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ABSTRACT 


This  report  describes  in  detail  our  studies  in  developing  and  evaluating  sparse 
linear  systems  on  scalable  architectures,  with  emphasis  on  preconditioned  iter¬ 
ative  solvers.  The  study  was  motivated  primarily  by  the  lack  of  robustness  of 
Krylov  subspace  iterative  schemes  with  generic,  “black-box”,  preconditioners 
such  as  approximate  (or  incomplete)  LU-factorizations  as  well  as  their  limited 
scalability  to  large-scale  parallel  platforms.  In  this  report  we  advocate  the  use 
of  banded  preconditioners  after  suitable  reordering  of  the  sparse  linear  systems. 
The  choice  of  the  reordering  scheme  is  based  on:  (i)  minimizing  the  bandwidth, 
and  (ii)  bringing  as  many  of  the  largest  elements  of  the  coefficient  matrix  as 
possible  to  a  “narrow”  central  band.  Next,  we  extract  a  prominent  central 
band  and  use  it  as  a  preconditioner.  In  Chapters  1  and  2  we  develop  a  paral¬ 
lel  algorithm,  “SPIKE”,  for  solving  banded  systems  that  are  considered  dense 
within  the  band.  We  also  show  that  our  solver  is  highly  scalable  and  superior 
in  performance  to  the  banded  system  solver  in  the  widely  available  parallel  li¬ 
brary  ScaLapack.  Also,  we  show  how  ’’SPIKE”  can  be  used  in  conjunction  with 
a  direct  sparse  system  solver  such  as  “Pardiso”,  which  is  a  subroutine  in  In¬ 
tel’s  Math  Kernel  Library  (MKL)  for  handling  banded  systems  that  are  sparse 
within  the  band.  In  Chapter  3,  we  present  a  weighted  reordering  scheme  that 
enables  the  extraction  of  effective  banded  preconditioners.  We  also  show  that 
the  time  consumed  by  such  a  reordering  scheme  is  a  small  fraction  of  the  total 
time  needed  to  solve  very  large  sparse  systems  using  preconditioned  BiCGstab 
or  GMRES  (two  prominent  members  of  the  Krylov  subspace  methods).  The 
effectiveness  of  our  strategy  is  demonstrated  on  a  wide  collection  of  publicly 
available  sparse  linear  systems  at  the  University  of  Florida  (see  Tim  Davis’ 
web-site).  In  Chapter  4,  we  describe  an  alternate  scalable  parallel  algorithm  for 
handling  banded  preconditioners  that  arise  from  domain  decomposition  in  which 
the  domains  overlap.  We  demonstrate  the  effectiveness  of  this  solver  compared 
to  direct  solvers  such  as  ScaLapack  or  preconditioned  iterative  solvers  for  both 
symmetric  positive  definite  systems  as  well  as  non-symmetric  systems.  Finally, 
in  Chapter  5,  we  present  our  study  concerning  prediction  of  parallel  scalability 
of  our  schemes  on  architectures  with  more  nodes/cores  than  the  platforms  on 
which  our  experiments  have  been  performed. 


Chapter  1 


Highly  Scalable  Linear 
Solvers:  The  SPIKE 
Algorithm 


1.1  Introduction 

Banded  linear  systems  arise  in  many  areas  of  computational  science  and  engi¬ 
neering  such  as  computational  mechanics  (fluids,  structures,  and  fluid-structure 
interaction)  [69,  71]  and  computational  nanoelectronics  [59].  These  applications 
often  give  rise  to  very  large  narrow  banded  linear  systems,  which  can  be  dense  or 
sparse  within  the  band.  For  example,  in  finite-element  analysis,  the  underlying 
sparse  linear  systems  can  be  reordered  to  result  in  a  banded  system  in  which 
the  width  of  the  band  is  a  small  fraction  of  the  number  of  unknowns. 

In  this  report,  we  describe  important  algorithmic  and  performance-related 
aspects  of  Spike,  [9,  10,  33,  51,  65,  66,  67,  68,  70].  The  underlying  basis  for 
Spike  is  a  divide  and  conquer  technique,  which  involves  the  following  stages: 
(a)  pre-processing:  (i)  partitioning  of  the  original  system  on  different  proces¬ 
sors,  or  different  Symmetric  Multiprocessors  (SMP’s),  (ii)  factorization  of  each 
diagonal  block  and  extraction  of  a  reduced  system  of  much  smaller  size;  (b) 
post-processing:  (iii)  solving  the  reduced  system,  and  (iv)  retrieving  the  overall 
solution.  Not  only  does  SPIKE  enable  multilevel  parallelism,  but  it  also  allows 
multiple  choices  for  the  pre-  and  post-processing  stages,  resulting  in  a  poly¬ 
algorithm.  We  will  show  that  this  algorithm  has  several  built-in  options  that 
range  from  using  it  as  a  pure  direct  solver  to  using  it  as  a  preconditioner  for 
any  iterative  scheme. 

We  present  several  numerical  experiments  that  demonstrate  significant  speed 
and  scalability  improvements  over  corresponding  routines  in  ScaLapack  [12]  for 
handling  large  banded  systems  on  a  high-end  computing  platform. 

A  general  description  of  the  SPIKE  algorithm  is  presented  in  Section  1.2. 


1 


1 


Figure  1.1:  Partitioning  of  the  matrix  A  and  the  RHS  F,  with  p  =  4.  The  size 
of  each  partition  j  is  nj,  and  the  size  of  the  coupling  off-diagonal  blocks  Bj  and 
Cj  of  the  original  matrix,  is  m  x  m. 


The  hybrid  and  poly-algorithm  nature  of  SPIKE  is  discussed  in  Section  1.3 
along  with  different  options  for  the  treatment  of  the  pre-  and  post-processing 
stages.  In  particular,  Section  1.4  and  1.5  focus  on  the  detailed  description  of 
two  versions  of  the  algorithm:  the  “Recursive”  and  the  “Truncated”  schemes, 
for  handling  general  and  diagonally  dominant  systems,  respectively.  In  Section 
1.6,  we  present  several  numerical  experiments  with  performance  comparisons 
versus  ScaLapack  on  the  computing  platform  IBM-SP. 


1.2  The  Spike  algorithm 

Consider  solving  the  linear  systems  AX  =  F,  where  A  is  a  narrow  banded  nxn 
matrix,  and  F  is  the  nx  s  multiple  right  hand  side  (RHS) .  We  assume  that  the 
bandwidth  b  is  much  less  than  the  system  order  n. 

1.2.1  Preprocessing  stage 
Partitioning 

Any  banded  linear  system  can  be  partitioned  into  a  block  tridiagonal  form. 
If  p  is  the  number  of  diagonal  blocks,  then  each  banded  diagonal  block  Aj 
(j  =  1 , ,p),  is  of  order  n.j  (or  roughly  of  order  n/p).  For  a  given  partition 
j,  Bj  ( j  =  1),  and  Cj  ( j  =  2 ,...,p),  are  the  coupling  matrices 

associated  with  the  first  super-  and  sub-diagonal  blocks,  respectively.  Each  is 
of  order  m  «  nj.  Figure  1.1  illustrates  the  partitioning  of  the  matrix  A  and 
the  RHS  for  p  =  4. 
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Figure  1.2:  The  spike  factorization  defined  by  A  =  DS,  where  D  is  a  block- 
diagonal  matrix  with  4  partitions.  A  new  linear  system  SX  =  G  needs,  then,  to 
be  solved,  where  G  is  the  modified  right  hand  side  (DG)  =  F)).  An  independent 
reduced  system,  of  much  smaller  size,  can  be  extracted  from  those  few  rows  of  S 
immediately  above  and  below  each  partitioning  line. 


Each  partition  can  be  associated  with  one  or  several  processors  (one  node), 
enabling  multilevel  parallelism.  For  structurally  symmetric  problems,  m  is  equal 
to  ( b  —  1) /2,  while  for  non-symmetric  structure,  for  coding  convenience,  we 
include  some  zero  elements  either  in  Bj  or  Cj  to  maintain  structural  symmetry, 
resulting  in  m  >  {b  —  l)/2. 

1.2.2  Factorization 

The  preprocessing  stage  includes  the  factorization  of  each  diagonal  block,  and 
obtaining  a  new  system  that  results  from  pre-multiplying  each  partition  by  the 
inverse  of  its  diagonal  block.  The  resulting  new  system  has  identity  matrices  as 
diagonal  blocks,  and  spikes  of  m  columns  in  the  immediate  off-diagonal  blocks. 

Assuming,  for  the  time  being,  that  each  Aj  is  nonsingular,  the  matrix  can 
be  factored  as  A  =  D  S,  where  D  is  a  block-diagonal  matrix  consisting  only  of 
the  diagonal  blocks  Aj, 


D  =  diag(Ai, ...,  Ap), 

with  S  being  the  spike  matrix  shown  in  Fig.  1.2.  For  a  given  partition  j,  we 
call  Vj  (j  =  1, . . .  ,p  —  1)  and  Wj  (j  =  2, ...  ,p),  respectively,  the  right  and  the 
left  spikes  each  of  order  nj  x  to. 
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The  spikes  Vj  and  W3  are  given  by 


Vj  =  (A,)-1  \  °  1  Bj,  and  Wj  =  (A,)”1  \  Im  ]  Cj.  (1.1) 

1m  J  \_  v 

The  spikes  Vj  and  Wj,  j  =  2,  ...,p  —  1,  may  be  generated  by  solving, 

'  0  Cj' 

Aj[Vj,Wj]=  '■  °  .  (1.2) 

0  : 

Bj  0 . 

1.2.3  The  Post-processing  stage 

Solving  the  system  AX  =  F  now  reduces  to  two  steps: 

(a)  solve  DG  =  F  (1.3) 

(b)  solve  SX  =  G.  (1.4) 

The  solution  of  the  linear  system  DG  =  F  in  Step  (a),  yields  the  modified  RHS 

G  needed  for  Step  b,  and  in  case  of  one  partition  per  processor,  is  performed 
with  perfect  parallelism.  In  case  we  decouple  the  pre-and  post-processing  stages, 
this  step  may  be  combined  with  the  generation  of  the  spikes  in  equations  (1.2). 

To  solve  SX  =  G  in  Step  b,  one  should  observe  that  the  problem  can  be 
reduced  further  by  solving  an  independent  system  of  much  smaller  size, 

SX  =  G,  (1.5) 

which  consists  of  the  m  rows  of  S  immediately  above  and  below  each  partitioning 
line.  Indeed,  the  spikes  Vj  and  Wj  can  also  be  partitioned  as  follows 

Vj  =  [VV  V'j  P/6)]T,  and  Wj  =  [W}t}  W'  W*jb)]T,  (1.6) 

where  V^\  Vj,  vjb\  and  W^\  Wj,  Wjb\  are  the  top  m,  the  middle  rij  —  2 to 
and  the  bottom  m  rows  of  Vj  and  Wj ,  respectively.  We  note  that 

vjb)  =  [0  Im]Vj ;  wf]  =  [Im  0  ]Wj,  (1.7) 

and 

v^  =  [Im  o  ]Vji  w^b)  =  [  0  Im]Wj.  (1.8) 

Similarly,  if  X3  and  Gj  are  the  jtli  partitions  of  X  and  G,  we  have 

Xj  =  [xf)  Xj  X{b)}T,  and  G3  =  [Gf  Gj  G{b)]T .  (1.9) 

It  is  then  possible  to  extract,  from  (1.4),  the  independent  reduced  linear  system 
(1.5),  which  involves  only  the  top  and  bottom  elements  of  Vj,  Wj,  Xj  and  Gj. 
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This  reduced  system  is  block  tridiagonal,  with  (p  —  1)  diagonal  blocks,  the  Mh 
of  which  is  given  by 


Kh  Im 

The  corresponding  off-diagonal  blocks  are 


(1.10) 


'wf}  o' 
0  0 


and 


'0  0  ' 

o 


and  the  associated  solution  and  RHS  are 


( b ) 


(t)  1 T 
k+  lJ  ’ 


and 


[Gf 


G 


W  i 
fc+iJ 


(1.11) 


Once  the  solution  X  of  the  reduced  system  (1.5)  is  obtained,  the  global 
solution  X  is  reconstructed  with  perfect  parallelism  from  Xj!  (k  =  1, . . .  ,p  —  1) 
and  Xj:  (k  =  2, . .  .,p)  as  follows: 

'  X[  =  G[  -  V{X$\ 

'  X^Gj-V'X^-W'X^,  j  =  2, . . .  ,p  —  1  (1.12) 

,  X'v  =  G’v-W'3X^_x. 


1.3  SPIKE:  a  poly-algorithm 

Several  options  are  available  for  efficient  implementation  of  the  SPIKE  algo¬ 
rithm  on  parallel  architectures,  for  reducing  the  complexity  and  required  stor¬ 
age.  These  choices  depend  on  the  properties  of  the  linear  system,  as  well  as  the 
architecture  at  hand.  More  specifically,  SPIKE  allows  three  major  options: 

1.  factorization  of  the  diagonal  blocks  Aj, 

2.  computation  of  the  spikes,  and 

3.  solution  scheme  for  handling  the  reduced  system. 

In  the  first  option,  each  linear  system  associated  with  Aj ,  can  be  solved 
either  (i)  directly,  making  use  of  Gaussian  elimination  (LU-factorization)  with 
partial  pivoting  or  Cholesky  factorization  if  (Aj)  is  symmetric  positive  defi¬ 
nite,  (ii)  using  LU  factorization  without  pivoting  but  with  a  diagonal  boosting 
strategy,  (iii)  iteratively  with  a  preconditioning  strategy,  (iv)  or  via  appropriate 
approximation  of  the  inverse  of  (Aj).  If  more  than  one  processor  is  associated 
with  one  partition,  another  level  of  the  SPIKE  algorithm  may  also  be  used.  In 
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the  following,  we  consider  only  the  case  where  one  partition  is  associated  with 
only  one  processor. 

In  the  second  option,  the  spikes  can  be  computed  either  explicitly  (fully  or 
partially)  using  equation  (1.2),  or  implicitly  -  “on-the-fly” .  We  will  restrict  the 
treatment  here  to  explicit  partial  or  complete  determination  of  the  spikes. 

In  the  third  option,  the  reduced  system  (1.5)  can  be  solved  either  (i)  directly 
using  a  “recursive”  form  of  the  SPIKE  algorithm,  (ii)  iteratively  with  a  precon¬ 
ditioning  scheme,  or  (iii)  approximately  within  a  “truncated”  SPIKE  scheme 
for  diagonally  dominant  systems.  Both  the  recursive  and  truncated  algorithms 
will  be  presented  in  detail  in  the  following  sections. 

In  addition,  outer  iterations  will  be  necessary  to  assure  sufficient  accuracy 
whenever  we  do  not  use  a  direct  method  to  solve  (1.3)  or  (1.5).  The  overall 
SPIKE  algorithm  is  then  used  as  a  preconditioner  for  the  outer  iterative  scheme 
(Krylov  subspace  scheme,  or  iterative  refinement)  in  which  (1.3)  is  solved  once 
and  for  all. 

A  multitude  of  options  for  SPIKE  have  been  implemented.  However,  a  com¬ 
plete  description  of  implementation  details  is  beyond  the  scope  of  this  report. 
In  the  following  we  describe  two  optimal  algorithms  for  handling  general  and 
diagonally  dominant  systems. 


1.4  The  non-diagonally  dominant  case 

In  this  section,  we  present  a  version  of  the  SPIKE  algorithm  that  is  general 
enough  for  handling  non-diagonally  dominant  systems.  In  particular,  we  con¬ 
sider  LU  factorization  of  the  diagonal  blocks  without  pivoting,  but  with  a  “di¬ 
agonal  boosting”  strategy  to  obtain  an  LU  factorization  of  slightly  perturbed 
counterparts  of  the  original  block.  This  is  followed  by  an  efficient  recursive 
SPIKE  scheme  for  solving  the  resulting  reduced  system. 

1.4.1  Factorization  step 

For  numerical  stability,  one  obtains  the  LU  factorization  of  each  block  Aj  with 
partial  pivoting,  often  by  making  use  of  the  LAPACK  routines  [1]  XGBTRF. 
Here  X  corresponds  to  S,  D,  or  Z,  for  single  real,  double  real,  and  double  complex 
arithmetic.  The  LAPACK  routines  XGBTRS,  based  on  Basic  Linear  Algebra 
Subroutines  BLAS  level-2  [1],  may  be  used  to  solve  for  the  spikes  (1.2)  and 
the  RHS  (1.3).  We  have  modified  this  LAPACK  procedure  in  order  to  take 
advantage  of  BLAS  level-3,  and  used  it  to  solve  triangular  systems  that  arise  in 
SPIKE. 

We  have  also  developed  a  version  of  SPIKE  in  which  the  LU-Factorization  is 
performed  without  pivoting  but  with  a  “diagonal  boosting”  strategy  to  overcome 
problems  associated  with  very  small  pivots.  The  magnitude  of  the  pivot  must 
then  satisfy  the  following  condition: 

I  pivot  |  >  0e  11-AjH, 
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where  0£  is  a  newly  defined  “machine  zero”,  and  ||.||  is  the  1-norm,  for  example. 
If  the  pivot  does  not  satisfy  this  condition,  then  its  value  is  boosted  using  e, 
which  depends  on  the  machine’s  unit  roundoff: 

pivot  =  pivot  +  e|  \Aj  |  if  pivot  >  0, 
pivot  =  pivot  —  e\ \Aj  |  if  pivot  <  0. 

We  have  implemented  a  modified  version  of  the  XDBTRF  routine  (originally 
developed  for  the  ScaLapack  package  to  handle  diagonally  dominant  matrices) 
augmented  by  a  diagonal  boosting  strategy.  If  diagonal  boosting  is  activated, 
SPIKE  is  not  used  as  a  direct  solver  but  rather  as  a  preconditioner.  In  this 
case  outer  iterations  via  a  Krylov  subspace  method,  for  example,  are  activated. 
In  most  cases,  few  outer  iterations  are  needed  to  achieve  adequate  reduction  of 
the  relative  residuals.  Finally,  since  the  LU  factorization  is  performed  without 
pivoting,  our  BLAS3-based  block  triangular  solver  can  be  used  for  both  the 
forward  and  backward  sweeps  for  solving  (1.2)  and  (1.3). 

1.4.2  The  Recursive  SPIKE  algorithm 

One  natural  way  to  solve  the  reduced  system  in  parallel  is  to  make  use  of  a 
Krylov  subspace  iterative  method  with  a  block  J  acobi  preconditioner  defined  by 
the  diagonal  blocks  (1-10)  of  the  reduced  system.  However,  for  non-diagonally 
dominant  systems,  this  preconditioner  may  not  be  effective  for  large  number 
of  partitions  (producing  large  reduced  systems).  This  often  results  in  a  large 
number  of  iterations  if  one  is  to  realize  reasonable  residuals,  and  in  high  inter¬ 
processor  communication  cost.  If  the  unit  cost  of  interprocessor  communication 
is  high,  the  reduced  system  may  be  solved  directly  on  a  single  processor.  This 
alternative,  however,  may  have  memory  limitations  if  the  size  of  the  reduced 
system  is  large. 

We  propose  a  new  direct  scheme  for  solving  the  reduced  system  in  parallel. 
This  “Recursive”  scheme,  involves  successive  iterations  of  the  SPIKE  algorithm 
resulting  in  better  balance  between  the  computational  and  communication  costs. 

Preliminary:  The  two-partitions  case 

Using  only  two  partitions,  p  =  2,  one  can  observe  that  the  reduced  system, 
which  consists  now  of  only  one  diagonal  block  (1.10),  needs  only  to  be  solved 
directly.  This  reduced  system  can  be  extracted  from  the  central  part  of  the 
system  (1.4) 


Im 

v™' 

'xf 

-G(b)~ 

w2(t) 

Im 

x$\ 

<$\ 

The  solution  steps  consist  of  the  following: 
•  Form  E  =  Im  -  W^v}b) 


(1.13) 
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•  Solve  EX <t}  =  G{2]  -  W^G{b)  to  obtain 

•  Compute  Xf}  =  G{b)  -  v}b)X 

The  rest  of  the  solution  of  X\,  and  X2  can  be  retrieved  using  (1.12). 

Recursive  SPIKE:  the  multiple  partitions  case 

Here,  we  assume  that  the  number  of  partitions  is  given  by  p  =  2d  (\/d  >  1). 
After  forming  the  spike  matrix  S  (see  Fig.  1.2),  the  number  of  partitions  of  the 
new  linear  system  (1.4)  can  be  divided  by  two,  and  another  level  of  the  SPIKE 
algorithm  may  be  applied.  This  process  is  repeated  recursively  until  we  obtain 
only  two  partitions  for  the  newest  matrix  S.  The  resulting  reduced  system  has 
the  form  (1.13). 

In  our  implementation,  the  Recursive  scheme  is  not  concerned  with  the  over¬ 
all  matrix  S  but  rather  with  the  matrix  S  of  the  reduced  system  (1.5)  itself.  This 
allows  us  to  simplify  the  implementation,  and  reduce  the  memory  requirements 
while  saving  all  the  different  levels  of  the  new  spikes.  Note  that  in  the  reduced 
system  (1.5),  the  matrix  S  is  block  tridiagonal,  where  the  diagonal  blocks  are 
defined  in  (1.10)  and  the  off  diagonal  blocks  are  as  given  in  (1.11). 

Observing  that  we  can  also  extract  an  independent  reduced  system  of  order 
2 mp  rather  than  order  2 m(p—  1),  if  we  include,  in  addition,  the  top  m  rows  and 
the  bottom  m  rows  of  the  first  and  last  partitions,  respectively.  In  this  case, 
the  structure  of  the  reduced  system  remains  block  tridiagonal  in  which  each 
diagonal  block  is  an  identity  matrix  of  order  2m,  and  the  off-diagonal  blocks 
associated  with  the  fc-th  diagonal  block  are  given  by, 


0 

I4(t) 

0 

0 

Wih\ 

for  k  =  2, . . 

•  ,P  , 

and 

v  k 

0 

(1.14) 

Denoting  the  spikes  of  the  new  reduced  system  at  level  1  of  the  recursion  by 
and  where 

411  =  [^(t)  Vkb)]T >  and  w%]  =  [WP  Wff,  (1.15) 


the  matrix  Si  of  the  new  reduced  system,  for  p  =  4  takes  the  form: 


L  _  _  _  J 


right  spikes:  k  =  1,  2, 3 

left  spikes:  k  =  2,3,4. 


In  preparing  for  level  2  of  the  recursion  of  the  Spike  algorithm,  we  partition 
the  matrix  Si  using  p/2  partitions  each  of  size  4?n.  The  matrix  Si  can  then  be 
factored  as 


Si  =  Di  S2 

where  D\  is  formed  by  the  p/2  diagonal  block  of  Si  each  of  size  4m,  thus  S2 

[21 

represents  the  new  Spike  matrix  at  the  level  2  composed  of  the  spikes  v/  and 
[21 

'w\.  .  For  p  =  4,  these  matrices  are  of  the  form, 


Di 


I 

0 


0 

I 


I 


0 


0 


I 


right  spikes:  k  =  1,  3 

left  spikes:  k  =  2,4 


and 
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right  spikes:  ;  k  =  1 

left  spikes:  w^-,  k  =  2. 


In  general,  at  level  i  of  the  recursion,  the  spikes  ,  and  with  fc  ranging 
from  1  to  p/(2l),  are  of  order  2lm  x  ?n.  Thus,  if  the  number  of  the  original 
partitions  p  is  equal  to  2d,  the  total  number  of  recursion  levels  is  c?  —  1  and  the 
matrix  Si  can  be  expressed  in  the  factored  form, 

Si  =  D!D2  •  •  •  Dd_!  Sd 

where  the  matrix  Sd  has  only  two  spikes  and  vjl^ .  The  reduced  system  can 
then  be  written  as 

SdX  =  B,  (1.16) 

where  B  is  the  modified  right  hand  side: 

B  =  D-11...D21D-1  G.  (1.17) 

If  we  assume  that  the  spikes  v\^ ,  and  vj^)  (Vfc)  of  the  matrix  S,  are  known  at  a 
given  level  i,  then  we  can  compute  the  spikes  ,  and  rc[,i+1^  at  level  i  +  1  as 
follows: 


Step  1 .  Denoting  the  bottom  and  the  top  blocks  of  the  spikes  at  the  level  i  by 

4'1(6)  =  [°  Wk'{t)  =  Vm  0]41 


and  the  middle  block  of  2m  rows  of  the  spike  at  level  i  +  1  by 


k 

,[i+l] 


=  [0  hm  0], 


b+i]. 


w 


w 


[i+1] 

k 

[i+l] 


=  [0  I 2  m  0]l 


,[<+!] 


one  can  form  the  following  reduced  systems 


w. 


m 

2k 


„[*](*>) 

J2k-1 


r  -  [i+iii 

vk 

'  0  ' 

K  J 

jm 

lu2k  J 

Vfc=  1,2,. 


'  ’  2 i-1 


-  1, 


(1.18) 
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and 


u2k 


„H(b) 

u2k-l 


r  •  b-i-iii 

L[il(6)l 

w2k-l 

••[i+l] 

K  J 

0 

Vfc  =  2,3,.... 


(1.19) 


These  reduced  systems  are  solved  in  a  manner  similar  to  (1.13)  to  obtain 
the  solutions  of  the  center  parts  of  the  spikes  at  the  level  i  +  1. 


Step  2  The  entire  solution  of  the  spikes  at  the  level  i  +  1  is  retrieved  as  follows 


l12 'm  UJ  vk  ~  V2k-lVk  > 


[U  J-2’-m\vk  —  v2  k  W2kVk  > 


(1.20) 


and 


[J2 *m  °]wfe  +  l1  =  w2k-l  -  4fc-l*fe  +  l1  [°  ^2‘m]w{!+11  =  -W^wil+11. 

(1.21) 


In  order  to  compute  one  step  of  the  modified  RHS  (1.17)  as  Gi  =  D;  1Gi_i 
(with  the  first  being  Gi  =  D1(1G),  one  has  to  solve  the  linear  system 
Di  Gi  =  Gi_i,  which  is  block  diagonal.  For  each  diagonal  block  k,  the  re¬ 
duced  systems  are  similar  to  those  in  (1.13)  or  in  (1.18)  and  (1.19)  but  the  RHS 
is  now  defined  as  function  of  Gi_i.  Once  we  obtain  the  partial  solution  at  the 
center  part  of  each  Gi,  associated  with  each  block  k  in  Di,  the  entire  solution 
is  retrieved  as  in  (1.20)  and  (1.21).  In  the  same  way,  the  linear  system  (1.16) 
involves  only  one  reduced  system  to  solve  and  only  one  retrieval  stage  to  get 
the  solution  X.  Finally,  the  overall  solution  X  is  obtained  using  the  procedure 
(1.12). 


1.4.3  Additional  remarks 

The  generation  of  the  spikes  at  the  various  levels  is  included  in  the  factorization 
step.  In  this  way,  the  solver  makes  use  of  the  spikes  stored  in  the  memory, 
thus  allowing  solution  of  the  reduced  system  quickly  and  efficiently.  Since  in 
many  applications  one  has  to  solve  many  linear  systems  with  the  same  coefficient 
matrix  A  but  with  different  RHSs,  optimization  of  the  solver  step  is  thus  crucial 
for  the  success  of  SPIKE. 


1.5  The  diagonally  dominant  case 

In  this  section,  we  describe  a  version  of  the  SPIKE  algorithm  that  is  optimized 
for  handling  diagonally  dominant  systems.  A  truncated  scheme  is  proposed 
that  yields  a  modified  reduced  system  that  is  block  diagonal  rather  than  block 
tridiagonal.  This  allows  solving  the  reduced  system  more  efficiently  on  parallel 
architectures.  Furthermore,  it  facilitates  different  new  options  for  the  factoriza¬ 
tion  steps  that  make  it  possible  to  avoid  the  computation  of  the  entire  spikes. 
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1.5.1  The  truncated  SPIKE  algorithm 

If  the  matrix  A}  is  diagonally  dominant,  one  can  show  that  the  magnitude  of 
the  elements  of  the  right  spikes  V3  decay  in  magnitude  from  bottom  to  top, 
while  the  elements  of  the  left  spikes  Wj  decay  from  top  to  bottom.  Since  the 
size  n  of  Aj  is  much  larger  than  the  size  m  of  the  blocks  Bj  and  Cj ,  the  bottom 
blocks  of  the  left  spikes  Wjb)  and  the  top  blocks  of  the  right  spikes  can  be 
approximately  set  equal  to  zero.  Thus,  it  follows  that  the  off-diagonal  blocks 
of  reduced  system  (1.11)  are  equal  to  zero,  and  the  reduced  matrix  S  is  then 
approximated  by  its  block  diagonal  (1.10).  Each  diagonal  block  of  this  truncated 
reduced  system  can  be  solved  directly  using  the  same  procedure  described  for 
solving  the  system  (1.13),  leading  to  enhanced  use  of  parallelism. 

1.5.2  The  truncated  factorization  stage 

Since  the  matrix  is  diagonally  dominant,  the  LU  factorization  of  each  block  Aj 
can  be  performed  without  pivoting,  using  for  example  the  modified  LAPACK 
routine  XDBTRF  in  the  ScaLapack  package  (no  boosting  strategy  is  necessary). 
Using  an  LU-factorization  without  pivoting,  one  can  get  the  bottom  block  of  Vj 
from  (1.1)  involving  only  the  bottom  m  x  m  blocks  of  L  and  U.  Obtaining  the 
top  block  of  Wj  still  requires  computing  the  entire  left  j-th  spike  with  complete 
forward  and  backward  sweeps.  Another  approach  consists  of  performing  a  UL- 
factorization  (Gaussian  elimination  that  produces  the  factorization  in  the  form 
of  the  product  of  an  upper  triangular  matrix  and  a  lower  triangular  matrix) 
without  pivoting.  Similar  to  the  LU-factorization,  this  allows  obtaining  the  top 
block  of  Wj  involving  only  the  top  mxm  blocks  of  the  new  U,  L.  Our  numerical 
experiments  indicate  that  the  time  consumed  by  this  LU/UL  strategy,  is  much 
less  than  that  taken  by  performing  only  one  LU  factorization  per  diagonal  block 
and  generating  the  entire  left  spikes.  Using  a  permutation  of  the  rows  and 
columns  of  the  original  matrix,  we  can  use  the  same  LU  factorization  routine 
to  perform  the  UL-factorization. 

Approximation  of  the  factorization  stage 

An  alternate  to  performing  a  UL-factorization  of  the  block  Aj  is  to  approximate 
the  top  of  the  left  spike,  W^\  of  order  m,  by  inverting  only  an  l  xl  (l  >  m)  top 
diagonal  block  (left  top  corner)  of  the  banded  block  Aj.  Typically,  we  choose 
l  =  2m  to  get  a  suitable  approximation  of  the  m  —  by  —  m  left  top  corner  of 
the  inverse  of  Aj.  However,  the  quality  of  this  approximation  depends  on  the 
degree  of  diagonal  dominance.  With  such  a  strategy,  SPIKE  thus  needs  to  be 
used  as  a  preconditioner.  In  general,  convergence  of  an  outer  Krylov  subspace 
iterative  scheme  is  often  realized  after  only  few  iterations. 

Using  the  above  Truncated  scheme,  the  entire  spikes  are  not  computed  explicitly. 
Therefore,  once  the  reduced  system  is  solved,  retrieving  of  the  entire  solution 
cannot  be  done  using  equation  (1.12).  Rather,  we  extract  the  RHS  from  the 
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contributions  of  the  tying  blocks  Bj  and  Cj,  and  then  solve  the  resulting  block 
diagonal  system  using  either  the  previously  computed  LU  or  UL  factorizations, 
as  shown  below: 


A,  Xl  =  Fi  — 


A?  Xj  —  Fj 


Ap  Xp  —  Fp 


TD 

^jA2  5 


TD 


SD  v(b) 

^j^p- 1- 


CjXf\,  j  =  2, . . .  ,p  —  1  (1.22) 


1.6  Performance  results  and  comparisons  with 
ScaLapack 

The  performance  of  different  versions  of  the  SPIKE  algorithm  are  compared 
with  ScaLapack’s  banded  solvers.  We  consider  two  sets  of  experiments: 

1.  speed  improvement  over  ScaLapack  for  systems  with  varying  bandwidths 
on  32  processors. 

2.  speed  improvement  over  ScaLapack  on  a  given  banded  system  as  the  num¬ 
ber  of  processors  is  varied  from  32  to  512. 

All  the  tests  are  performed  on  the  DataStar  IBM-SP  power4  platform.  This 
platform  has  176  (8-way)  P655+  nodes,  with  16  GB  of  memory  and  1.5GHz 
CPU  speed.  Also,  the  use  of  each  8-way  node  is  exclusive  allowing  only  one  user 
at  the  node  at  any  time. 


1.6.1  SPIKE:  performance  on  32  processors 

We  consider  a  system  of  order  n  =  480,000  that  is  dense  within  the  band, 
with  one  RHS  (s  =  1).  Three  main  tests  with  different  SPIKE  algorithms, 
have  been  conducted  and  the  results  are  listed  in  Table  (1.1).  Tests  1,  2a 
and  2b,  are  concerned  with  non-diagonally  dominant  systems,  while  Tests  3a 
and  3b,  deal  with  diagonally  dominant  ones  as  the  bandwidth  (maximum  num¬ 
ber  of  nonzero  elements  per  row)  varies  from  81  to  641.  These  systems  are 
solved  using  the  SPIKE  schemes  presented,  respectively,  in  sections  1.4  and 
1.5.  For  each  test,  we  show  the  speed  improvement  of  the  SPIKE  algorithm 
over  ScaLapack,  A  =  Tsca./Tspike,  for  the  factorization,  the  solver,  as  well  as 
the  total  time.  The  ScaLapack  routines  used  for  the  factorization  and  solve 
stages,  are  PDGBTRF  and  PDGBTRS  for  non-diagonally  dominant  systems, 
and  PDDBTRF  and  PDDBTRS  for  the  diagonally  dominant  one. 
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Table  1.1:  List  of  various  tests  associated  with  various  options  for  the  SPIKE 
algorithm. 


Diagonally 

Dominant 

Factorization 

step 

Solve 

step 

Outer- 

iterations 

Test  1 

No 

LU  with  pivoting 

Recursive 

No 

Test  2a 

No 

LU  without  pivoting  +  boost. 

Recursive 

No 

Test  2b 

No 

LU  without  pivoting  +  boost. 

Recursive 

Yes 

Test  3a 

Yes 

LU-UL  without  pivoting 

Truncated 

No 

Test  3b 

Yes 

LU  without  pivoting  +  approx. 

Truncated 

Yes 

Test  1 

In  this  experiment  for  non-diagonally  dominant  systems,  SPIKE  is  performed 
using  the  Recursive  algorithm  to  solve  the  reduced  system.  The  LU-factorization 
of  the  diagonal  blocks  of  the  original  matrix,  is  implemented  with  partial  piv¬ 
oting  using  the  LAPACK  routine:  DGBTRF.  SPIKE  is  thus  used  as  a  direct 
solver.  Here,  we  have  assumed  that  each  diagonal  block  is  nonsingular.  The 
speed  improvement  factor,  A,  presented  in  Fig. 1.3  shows  that  SPIKE  is,  on  the 
average  twice  as  fast  as  ScaLapack  for  the  factorization  and  solve  steps. 

Test  2a  and  2b 

In  this  set  of  numerical  experiments,  we  assume  that  we  have  no  knowledge 
regarding  the  non-singularity  of  the  diagonal  blocks.  Thus,  avoiding  pivoting  in 
the  LU-factorization  of  the  diagonal  blocks,  together  with  adopting  a  diagonal 
boosting  strategy,  we  obtain  LU  factorizations  of  slightly  perturbed  block  diag¬ 
onal  matrices.  The  Recursive  SPIKE  scheme  is  thus  used  as  a  preconditioner 
and  outer-iterations  are  needed  to  assure  convergence  to  the  true  solution,  i.e., 
assuring  that  the  relative  residual: 


AX-FHoo 


reaches  a  given  tolerance.  If  diagonal  boosting  is  performed  during  the  fac¬ 
torization  stage  (i.e.,  a  “zero-pivot”  is  detected),  the  solve  part  of  the  SPIKE 
algorithm  is  automatically  used  inside  an  iterative  scheme.  In  our  current  imple¬ 
mentation,  we  can  use  either  the  BiCGstab  iterative  scheme,  or  regular  iterative 
refinement.  When  an  outer  iteration  is  activated,  we  use  the  stopping  criterion, 
r  <  1CU8.  If  no  diagonal  boosting  is  performed  during  the  factorization  stage, 
no  outer  iteration  is  needed  and  SPIKE  is  used  as  a  direct  solver.  Throughout, 
our  experiments  yielded  relative  residuals  much  smaller  than  10-8. 

Figure  1.4  shows  performance  results  for  the  two  following  cases:  (i)  Test 
2a  -  no  diagonal  boosting,  no  outer-iteration  needed,  (ii)  Test  2b  -  diagonal 
boosting,  outer-iterations  needed. 

For  the  factorization  stage  of  both  Test  2a  and  Test  2b,  the  speed  improve¬ 
ment  A  over  ScaLapack,  starts  at  2.1  for  a  bandwidth  of  b  =  81,  increasing 
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Figure  1.3:  Speed  improvement  for  Test  1  of  Recursive  SPIKE,  with  partial 
pivoting,  over  ScaLapack  for  non-diagonally  dominant  systems. 


steadily  as  the  bandwidth  increases  until  it  reaches  11.5  for  b  =  641.  This  rep¬ 
resents  a  significant  speed  improvement  for  the  SPIKE  algorithm  than  those 
presented  in  Fig. 1.3. 

For  the  solve  stage,  Test  2a  shows  better  results  than  Test  1,  with  A  reaching 
3.5,  while  in  Test  2b  A  averages  only  1.3.  The  results  are  shown  after  one-half 
BiCGstab  iteration  (one  or  two  iterative  refinements  are  also  enough  to  get  the 
desired  convergence) .  Although  the  solve  stage  is  slower  for  this  case,  due  to  the 
time  taken  by  the  matrix-vector  multiplications  in  the  outer  iterative  scheme, 
overall  SPIKE  is  still  faster  than  ScaLapack  by  factors  ranging  from  2.0  to  11.0. 

Test  3a  and  3b 

These  experiments  are  concerned  with  diagonally  dominant  systems  solved  using 
the  Truncated  SPIKE  algorithm.  In  Test  3a,  SPIKE  is  used  as  a  direct  solver 
using  the  LU-UL  factorization  strategy.  In  Test  3b,  only  LU  is  performed  on 
the  diagonal  blocks  and  the  top  blocks  Wj  are  approximated,  thus  SPIKE  is 
used  as  a  preconditioner  for  a  suitable  outer  iteration.  Figure  1.5  shows  the 
speed  improvements  over  ScaLapack  obtained  using  these  two  schemes.  For  the 
factorization  stage  of  Test  3a,  A  increases  with  increasing  bandwidth  from  1.3 
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Figure  1.4:  Speed  improvement  over  ScaLapack  for  Recursive  SPIKE  without 
pivoting  for  non-diagonally  dominant  matrices.  No  “ zero-pivot ”  is  detected  in 
Test  2a,  while  in  Test  2b  outer-iterations  are  needed  after  diagonal  boosting.  In 
Test  2b,  one  half  iteration  of  BiCGstab  is  necessary  to  satisfy  the  convergence 
criterion:  r  <  ICO8 


for  b  =  81  to  5.3  for  b  =  641.  In  Test  3b,  the  corresponding  speed  improvement 
roughly  doubles  indicating  that  the  time  taken  to  approximate  the  blocks  Wj 
is  negligible  compared  to  the  the  time  taken  by  the  UL  factorization.  For  the 
solve  stage,  the  SPIKE  and  ScaLapack  schemes  are  equivalent.  For  Test  3a, 
A  =  1,  and  for  Test  3b,  A  drops  to  roughly  0.2.  This  is  due  to  the  matrix- vector 
multiplications  required  by  these  1  to  3  BiCGstab  outer  iterations  needed  for 
convergence.  Overall,  however,  speed  improvement  over  ScaLapack  based  on 
the  total  time  is  still  significant  as  the  bandwidth  increases.  We  note  that  the 
version  of  SPIKE  used  in  Test  3b  yields  higher  speed  improvement  and  thus 
could  be  of  value  in  applications  where  the  factorization  and  solve  stages  are 
not  decoupled. 

Additional  remarks 

Table  1.2  lists  the  different  tests  summarizing  the  time  taken  by  both  the  fac¬ 
torization  and  solve  stages  of  each  algorithm  for  handling  a  given  system  with 
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Figure  1.5:  Speed  improvement  over  ScaLapack  for  the  Truncated  SPIKE  al¬ 
gorithm  for  diagonally  dominant  systems.  Test  3a  uses  the  LU-UL  strategy, 
while  the  Test  3b  requires  outer  iterations  because  of  the  approximation  used  to 
compute  Wj  in  the  factorization  stage. 


bandwidth  b  =  401.  For  the  factorization  stage,  one  can  see  that  Test  1  con¬ 
cerned  the  most  expensive  SPIKE  scheme  -  LU-factorization  with  partial  piv¬ 
oting.  Without  pivoting,  but  with  a  diagonal  boosting  strategy,  this  time  drops 
significantly  in  Tests  2a  and  2b.  The  LU-UL  strategy  used  in  Test  3a  reduces 
the  times  consumed  further  by  avoiding  the  generation  of  the  spikes.  This  time 
is  again  cut  in  half  in  Test  3b  by  eliminating  the  need  of  the  UL  factorization. 
For  the  solve  stage,  the  best  time  is  obtained  for  Tests  2a  and  3a  as  no  outer 
iteration  is  involved,  while  outer  iterations  cause  Tests  2b  and  3b  to  be  the  most 
expensive. 

1.6.2  Scalability 

In  this  section,  we  consider  a  system  of  order  n  =  1, 920,  000  with  a  bandwidth 
b  =  401.  Different  versions  of  SPIKE  are  used  for  diagonally  or  non-diagonally 
dominant  systems  on  p  =  32,  64,  128,  256  and  512  processors.  We  note  that  until 
p  =  512,  the  size  of  each  partition  nj  =  n/p  is  much  larger  than  m  =  (b  —  l)/2 
(number  of  columns  in  each  spike) . 
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Table  1.2:  List  of  various  tests  associated  with  various  options  for  the  SPIKE 
algorithm  augmented  with  factorization  and  solve  times  (in  seconds)  obtained 
for  b  =  401.  ScaLapack  times  are  also  given  for  comparisons. 


Diagonally 

Dominant 

Factorization 

step 

Solve 

step 

Outer- 

iterations 

Spike  Time(s) 
Fact.,  Solve 

Seal.  Time(s) 
Fact.,  Solve 

Test  1 

No 

LU  with  pivoting 

Recursive 

No 

9.11, 

0.16 

20.41, 

0.30 

Test  2a 

No 

LU  without  pivoting  +  boost. 

Recursive 

No 

2.56, 

0.10 

20.61 

0.31 

Test  2b 

No 

LU  without  pivoting  +  boost. 

Recursive 

Yes 

2.55, 

0.29 

20.55 

0.32 

Test  3a 

Yes 

LU-UL  without  pivoting 

Truncated 

No 

1.22, 

0.09 

3.92 

0.082 

Test  3b 

Yes 

LU  without  pivoting  +  approx. 

Truncated 

Yes 

0.6, 

0.32 

3.87 

0.082 

Non-diagonally  dominant  systems 

Two  versions  of  SPIKE  are  considered:  the  Recursive  SPIKE  algorithm  with 
pivoting  used  in  Test  1,  and  the  Recursive  SPIKE  without  pivoting  used  in 
Test2b.  Figures  1.6  and  1.7,  respectively  compare  the  times  consumed  for  the 
factorization  and  the  solve  stages.  For  the  factorization  stage,  both  versions 
of  SPIKE  are  faster  than  ScaLapack.  Although  SPIKE  without  pivoting  is 
the  fastest  algorithm,  its  scalability  is  not  as  robust  due  to  the  communication 
costs  on  the  IBM-SP.  However,  as  the  ratio  n3 /m  increases  (up  to  128),  the 
computational  time  tends  to  be  more  dominant  than  that  required  for  inter¬ 
processor  communications  resulting  in  good  scalability  on  the  IBM-SP.  For  the 
solve  stage,  both  versions  of  Spike  perform  better  than  ScaLapack.  In  all  the 
cases,  for  this  computing  platform,  scalability  deteriorates  beyond  128  proces¬ 
sors  as  the  interprocessor  communication  time  required  for  solving  the  reduced 
system  increases. 

Diagonally  dominant  systems 

As  in  Test  3a,  we  consider  the  LU-UL  strategy  for  the  Truncated  SPIKE.  In 
Fig.  1.8,  one  can  see  that  the  factorization  times  of  SPIKE  are  lower  than  those 
of  ScaLapack,  with  reasonable  scalability  until  128  processors.  The  results 
obtained  for  the  solve  stage  in  Fig.  1.9  show  that  the  ScaLapack  and  SPIKE 
times  are  equivalent  until  256  processors  and  the  scaling  is  almost  optimal. 
However,  for  512  processors,  SPIKE  proves  to  be  more  scalable  than  ScaLapack. 
Indeed,  using  the  Truncated  version  of  SPIKE,  the  communication  time  to  solve 
the  reduced  system  is  minimal,  and  the  computational  time  dominates  that 
required  by  interprocessor  communication. 
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Table  1.3:  Test  1-  Simulation  times  and  relative  speed  improvement  (X)  vs 
ScaLapack,  obtained  with  the  Recursive  SPIKE  with  partial  pivoting  for  non- 
diagonally  dominant  systems. 


Bandwidth 

Factorization 
Tsca.(s)  A  TSpike(s) 

Tsca.  (s) 

Solve 

A  T 

Spike  (s) 

Tsca.  (s) 

Total 

A  T 

Spike  (s) 

b— 81 

0.43 

1.1 

0.41 

0.088 

1.6 

0.054 

0.52 

1.1 

0.46 

b=161 

1.64 

1.4 

1.18 

0.121 

1.6 

0.078 

1.76 

1.4 

1.25 

b— 241 

5.22 

2.1 

2.45 

0.19 

1.9 

0.10 

5.42 

2.1 

2.55 

b=321 

8.83 

2.2 

4.71 

0.24 

1.8 

0.13 

9.07 

1.9 

4.84 

b=401 

20.41 

2.2 

9.11 

0.30 

1.9 

0.16 

20.72 

2.2 

9.28 

b— 481 

34.69 

2.2 

15.57 

0.37 

2.0 

0.18 

35.06 

2.2 

15.75 

b— 561 

47.91 

2.2 

22.11 

0.48 

2.2 

0.22 

48.39 

2.2 

22.33 

b— 641 

75.70 

2.6 

29.20 

0.69 

2.8 

0.25 

76.39 

2.6 

29.45 

Table  1.4:  Test  2a-  Simulation  times  and  relative  speed  improvement  (X)  w.r.t. 
ScaLapack,  obtained  with  the  Recursive  SPIKE  without  pivoting  where  no-outer 
iteration  is  needed,  for  non- diagonally  dominant  systems. 


Bandwidth 

Factorization 
Tsca.(s)  A  TSpike(s) 

Tsca.  (s) 

Solve 

A  T 

Spike  (s) 

Tsca.  (s) 

Total 

A 

^Spike  (s) 

cr 

II 

oo 

0.49 

2.4 

0.21 

0.090 

4.1 

0.022 

0.58 

2.5 

0.23 

b=161 

1.63 

3.1 

0.53 

0.130 

2.9 

0.044 

1.75 

3.1 

0.57 

b— 241 

5.24 

5.1 

1.03 

0.20 

3.1 

0.064 

5.44 

5.0 

1.10 

b=321 

8.83 

5.3 

1.65 

0.25 

3.2 

0.078 

9.08 

5.2 

1.73 

b=401 

20.61 

8.1 

2.56 

0.31 

3.1 

0.099 

20.61 

7.9 

2.66 

b=481 

34.75 

9.5 

3.68 

0.37 

3.1 

0.12 

35.12 

9.3 

3.79 

b=561 

47.99 

9.5 

5.05 

0.48 

3.6 

0.14 

48.47 

9.3 

5.19 

b— 641 

75.69 

11.5 

6.56 

0.66 

3.9 

0.17 

76.36 

11.3 

6.74 
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Table  1.5:  Test  2b-  Simulation  times  and  relative  speed  improvement  (X)  w.r.t. 
ScaLapack,  obtained  with  the  Recursive  SPIKE  without  pivoting  where  outer 
iterations  are  needed ,  for  non-diagonally  dominant  systems. 


Bandwidth 

Factorization 
Tsca.(s)  A  TSpike(s) 

Tsca.  (s) 

Solve 

A  T 

Spike  (s) 

Tsca.  (s) 

Total 

A 

Tspike  (®) 

b=81 

0.44 

2.1 

0.21 

0.090 

1.5 

0.060 

0.59 

2.0 

0.28 

b— 161 

1.68 

3.2 

0.53 

0.14 

1.2 

0.12 

1.83 

2.8 

0.65 

b— 241 

5.26 

5.1 

1.04 

0.20 

1.1 

0.17 

5.46 

4.5 

1.21 

b=321 

8.93 

5.4 

1.66 

0.27 

1.2 

0.23 

9.20 

4.9 

1.89 

b=401 

20.55 

8.1 

2.55 

0.32 

1.1 

0.29 

20.87 

7.4 

2.84 

b— 481 

34.27 

9.4 

3.66 

0.38 

1.1 

0.34 

34.64 

8.7 

4.00 

b— 561 

47.85 

9.5 

5.04 

0.47 

1.2 

0.39 

48.33 

8.9 

5.44 

b— 641 

75.84 

11.5 

6.58 

0.72 

1.5 

0.47 

76.56 

10.9 

7.05 

Table  1.6:  Test  3a-  Simulation  times  and  relative  speed  improvement  (X)  w.r.t. 
ScaLapack,  obtained  with  the  Truncated  SPIKE  with  LU-UL  strategy,  for  diag¬ 
onally  dominant  systems. 


Bandwidth 

Factorization 
Tsca.(s)  A  TSpike(s) 

Tsca.  (s) 

Solve 

A  T 

Spike  (s) 

Tsca.  (s) 

Total 

A  T 

Spike  (s) 

b=81 

0.20 

1.3 

0.16 

0.022 

1.0 

0.022 

0.22 

1.2 

0.18 

b=161 

0.63 

1.8 

0.35 

0.041 

0.9 

0.044 
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Figure  1.6:  Factorization  times  (on  log  scale)  taken  by  ScaLapack,  Recursive 
SPIKE  with  pivoting,  and  Recursive  SPIKE  without  pivoting  for  non-diagonally 
dominant  systems. 


Table  1.7:  Test  3b-  Simulation  times  and  relative  speed  improvement  (X)  w.r.t. 
ScaLapack,  obtained  with  the  Truncated  SPIKE  with  LU  factorization  and  an 
approximation  strategy,  for  diagonally  dominant  systems. 


Bandwidth 

Factorization 

Tsca.  (s)  A  TSpike(s) 

Tsca.  (s) 

Solve 

A  T 

Spike  (s) 

Tsca.  (s) 

Total 

A  T 

Spike  (s) 

b— 81 

0.20 

2.7 
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0.3 
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1.6 
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0.68 
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Figure  1.7:  Solve  times  (on  log  scale)  taken  by  ScaLapack,  Recursive  SPIKE  with 
pivoting,  and  Recursive  SPIKE  without  pivoting  for  non- diagonally  dominant 
systems. 
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Figure  1.8:  Time  for  the  factorization  stage  (on  log  scale)  for  ScaLapack,  and 
the  Truncated  SPIKE  using  LU-UL  strategy,  for  diagonally  dominant  systems. 
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Figure  1.9:  Time  for  the  solve  stage  (on  log  scale)  for  ScaLapack,  and  the  Trun 
cated  SPIKE  using  LU-UL  strategy,  for  diagonally  dominant  systems. 


Chapter  2 


A  Parallel  Framework  for 
Solving  Banded  Linear 
Systems  that  are  Sparse 
Within  the  Band 

2.1  Introduction 

A  large  number  of  applications  produce  narrow  banded  systems  (with  or  without 
reordering)  that  are  sparse  within  the  band.  For  these  systems  in  which  the 
bandwidth  is  not  sufficiently  narrow  so  as  to  be  treated  as  dense  within  the 
band,  we  need  to  develop  an  alternate  version  of  SPIKE.  A  SPIKE  “on-the- 
fly”  scheme  is  proposed  that  does  not  require  the  generation  of  the  spikes,  and 
for  which  the  reduced  system  is  not  formed  explicitly.  Rather,  the  reduced 
system  is  solved  via  a  matrix-free  iterative  scheme  (such  as  BiCGstab)  in  which 
the  matrix- vector  multiplications  are  performed  “on-the-fly” .  This  scheme  is 
ideally  suited  for  handling  banded  systems  with  large  sparse  bands.  In  addition, 
a  multi-level  parallel  version  of  SPIKE  is  proposed  to  take  advantage  of  parallel 
sparse  direct  solvers  such  as  SupeLU  [52],  MUMP’S  [56]  or  PARDISO  [58]. 

Solving  the  reduced  system  implicitly 

The  top  and  bottom  parts  of  the  spikes  Vj  and  Wj  that  constitute  the  reduced 
system,  can  be  expressed  as  follows: 
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2.2  A  sparse  SPIKE  system  solver  using  multi¬ 
level  parallelism 

For  banded  systems  that  are  sparse  within  the  band,  existing  parallel  sparse 
direct  solvers  such  as  SupeLU,  MUMPS,  or  PARDISO,  may  not  be  successful 
in  minimizing  the  fill-in  in  the  factorization  stage  even  after  reordering.  This  is 
particularly  true  if  the  matrix  is  narrow  banded,  i.e.,  the  ratio  of  the  bandwidth 
to  the  system  size  is  small  (ranging  from  0.001  to  0.01).  In  the  SPIKE  scheme, 
the  partitioning  stage  will  create  diagonal  blocks  with  larger  bandwidth-size 
ratios  that  allow  efficient  use  of  these  sparse  direct  solvers  on  each  partition.  To 
minimize  the  time  taken  by  the  solve  stage  of  the  SPIKE  “on-the-fly”  scheme, 
we  employ  a  two-level  parallel  scheme  for  obtaining  a  reduced  system  of  much 
smaller  size,  and  thus  easier  to  solve  by  an  iterative  scheme.  For  instance  in  Fig. 
2.1,  we  consider  four  partitions  and  make  use  of  the  PARDISO  shared  memory 
direct  solver  in  each  partition  (i.e.,  node),  while  the  SPIKE  distributed  memory 
scheme  is  used  across  partitions  (i.e.,  nodes). 


SPIKE 


r 


.a. 


Node  1  Node  2  Node  3  Node  4 


A 


Pardiso  Pardiso  Pardiso  Pardiso 

Figure  2.1:  SPIKE  with  two-level  parallelism  using  PARDISO  within  each  node. 
The  machine  used  for  the  illustration,  has  4  dual-processor  nodes. 
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Table  2.1:  Test  (a).  SPIKE  runs  using  PARDISO  on  2  and  4  nodes  and  Trun¬ 
cated  SPIKE  on  8  processors,  with  Reordering,  Factorization,  Solve,  and  Total 
times  in  seconds. 


Test  (a) 

Reord. 

Fact. 

Solve 

Total 

Residual 

SPIKE  2-nodes 

9.48 

1.55 

4.54 

15.57 

10“' 

SPIKE  4-nodes 

5.68 

0.77 

2.3 

8.76 

10“' 

Trunc.  SPIKE  (8) 

0 

0.84 

0.14 

0.98 

10“14 

2.3  Numerical  experiments 

We  consider  two  test  cases  for  non-diagonally  dominant  symmetric  positive  def¬ 
inite  systems  with  different  bandwidths.  The  systems  are  obtained  via  a  3-D 
finite  element  discretization  of  the  Poisson  equation  [59] : 

Test  (a)  size  n  =  432,000,  bandwidth  h  =  177,  number  of  non-zero  elements 
nnz  =  7,955, 116,  with  10.4%  sparsity  density  within  the  band. 

Test  (b)  n  =  471,800,  b  =  1455,  nnz  =  9,499,744,  1.4%  sparsity  density 
within  the  band. 

Our  numerical  experiments  are  performed  on  an  Intel  dual-Xeon  3.2  Ghz 
Linux  cluster  of  four  nodes  with  Infiniband  connection.  Observing  that  the 
PARDISO  package  is  included  in  the  Intel-MKL  library,  Table  2.1  shows  the 
time  consumed  in  the  various  stages  of  SPIKE  with  PARDISO  on  2  and  4 
nodes,  and  the  Truncated  SPIKE  on  8  processors  assuming  that  the  matrix  is 
dense  within  the  band. 


Table  2.2:  Test  ( b ).  SPIKE  runs  using  PARDISO  on  2  and  4  nodes,  with 
Reordering,  Factorization,  Solve,  and  Total  times  in  seconds. 


Test  (b) 

Reord. 

Solve 

|  Total 

Residual 

SPIKE  2-nodes 

18.69 

10“' 

SPIKE  4-nodes 

8.41 

Eiigl 

BUI 

10“' 

One  observes  that  for  such  narrow  banded  systems  as  proposed  in  test  (a), 
SPIKE  becomes  much  more  effective  (speed  and  accuracy)  if  we  consider  the 
matrix  as  dense  within  the  band.  Also,  as  stated  previously,  we  demonstrate 
that  the  Truncated  SPIKE  scheme  converges  nicely  even  if  the  system  in  test  (a) 
is  not  diagonally  dominant.  The  SPIKE-PARDISO  “on-the-fly”  scheme  shows 
good  scalability  for  the  reordering,  factorization  and  solve  stages,  where  the 
outer  iterations  are  performed  using  BiCGstab  with  a  relative  residual  stopping 
criterion  of  10~7.  In  contrast,  if  one  uses  PARDISO  alone  on  one  and  two 
processors  on  the  same  system,  the  reordering  times  remain  almost  unchanged 
at  18.5s,  while  the  factorization  times  range  from  4.1s  to  3.1s.  Also,  the  times 
consumed  by  the  distributed  memory  solvers  MUMPS  and  SuperLU  on  1,  2,  4 
and  8  processors  associated  with  1,  2,  4,  and  4  nodes,  respectively,  are  given  by: 
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•  reordering  stage:  ~  17s  for  MUMPS  (for  all  parallel  configurations),  and 
from  10.5s  to  7.5s  for  SuperLU, 

•  factorization  stage:  6.3s,  4.2s,  3s  and  20s  for  MUMPS  with  excessive  mem¬ 
ory  swaps  on  8  processors,  and  from  17s  to  180s  for  SuperLU  due  to  too 
much  fill-in  and  memory  swaps  on  our  machine  (with  4GB  of  memory  per 
node) . 

In  contrast  to  test  (a),  the  system  in  test  (b)  cannot  be  treated  as  dense 
within  the  band.  Table  2.2  shows  the  results  realized  by  SPIKE  with  PARDISO 
on  2  and  4  nodes.  SPIKE-PARDISO  “on-the-fly”  also  shows  very  good  scalabil¬ 
ity,  with  more  than  a  factor  two  for  the  reordering  stage.  Due  to  excessive  fill-in 
in  the  factorization  stage,  both  MUMPS  and  SuperLU  require  more  memory 
than  available  in  our  platforms. 
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Chapter  3 


Weighted  Matrix  Ordering 
and  Banded 
Preconditioners  for 
Non-symmetric  Linear 
System  Solvers 

3.1  Introduction 

Solving  sparse  linear  systems  is  the  most  time  consuming  part  of  many  applica¬ 
tions  in  computational  science  and  engineering.  While  direct  methods  provide 
robust  solvers  (i.  e.,  they  are  guaranteed  to  find  an  existing  solution  in  a  pre¬ 
cisely  characterizable  amount  of  time),  their  application  is  feasible  only  if  the 
system  is  not  very  large.  Iterative  methods,  on  the  other  hand,  take  advantage 
of  system  properties  to  provide  good  approximations  to  the  solution  in  much 
shorter  times.  Furthermore,  while  iterative  solvers  are  more  scalable  on  paral¬ 
lel  architectures,  they  are  not  as  robust  as  direct  solvers.  In  iterative  solvers, 
preconditioners  are  often  used  to  improve  the  convergence  properties.  Unlike 
approximate  factorization-based  preconditioners,  banded  preconditioners  have 
excellent  parallel  scalability  [61]. 

In  order  to  apply  a  banded  solver,  it  is  necessary  to  reduce  the  bandwidth  of 
the  system.  While  traditional  bandwidth  reduction  techniques  such  as  Reverse 
Cuthill-McKee  [20]  and  Sloan  [62],  are  successful  for  classes  of  problems,  their 
application  for  extraction  of  effective  banded  preconditioners  is  limited.  This  is 
because,  the  performance  of  these  algorithms  depends  on  the  intrinsic  sparsity 
pattern  of  the  matrix  under  consideration,  i.e.,  they  are  only  applicable  if  the 
matrix  can  be  reordered  into  a  narrow  banded  system,  so  that  a  direct  solver  can 
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be  used.  Parallel  banded  solvers,  such  as  SPIKE  [67],  also  require  a  reasonably 
small  bandwidth,  since  the  volume  of  communication  is  directly  dependent  on 
the  bandwidth.  Clearly,  this  may  not  be  the  case  for  most  linear  systems. 
Symmetric  reordering  alone  does  not  guarantee  a  nonsingular  preconditioner. 
In  fact,  in  many  applications  such  as  circuit  or  chemical  process  simulation,  the 
main  diagonal  is  likely  to  contain  many  zeros  raising  the  likelihood  of  a  singular 
banded  preconditioner. 

To  overcome  these  problems,  we  developed  a  bandwidth  reduction  technique 
aimed  at  encapsulating  as  many  of  the  heaviest  elements  of  the  matrix  into  a 
central  narrow  band.  This  makes  it  possible  to  extract  a  narrow-banded  pre¬ 
conditioner  by  dropping  the  entries  that  are  outside  the  band.  To  solve  the 
weighted  bandwidth  reduction  problem,  we  use  a  weighted  spectral  reordering 
(WSO)  technique  that  provides  an  optimal  solution  to  a  continuous  relaxation 
of  the  corresponding  optimization  problem.  This  technique  is  a  generalization  of 
spectral  reordering,  which  has  also  been  effectively  used  for  reducing  the  band¬ 
width  and  envelope  of  sparse  matrices  [4] .  To  alleviate  the  problems  associated 
with  symmetric  reordering,  we  couple  this  method  with  non-symmetric  reorder¬ 
ing  techniques,  such  as  the  maximum  traversal  algorithm  [25],  to  make  the  main 
diagonal  zero  free  and  place  the  largest  entries  on  the  main  diagonal  [26] . 

The  resulting  algorithm  can  be  summarized  as  follows:  (i)  use  non-symmetric 
reordering  to  make  the  main  diagonal  free  of  zeros,  (ii)  use  weighted  spectral 
reordering  to  move  larger  elements  closer  to  the  diagonal,  and  (iii)  extract  a 
narrow  central  band  to  use  as  a  preconditioner  for  a  Krylov  subspace  method. 
Our  results  show  that  this  yields  a  fast,  highly  parallelizable  preconditioner 
with  excellent  convergence  characteristics,  particularly  for  non-symmetric  ma¬ 
trices  with  significantly  variable  entries  in  terms  of  their  magnitude.  We  also 
demonstrate  that  WSO  is  more  robust  than  LU-type  preconditioners. 


3.2  Background  and  Related  Work 

Solving  sparse  linear  systems, 

Ax  =  f,  (3.1) 

is  the  most  time  consuming  part  of  many  applications  in  scientific  computing. 
Direct  solvers,  although  very  robust,  are  feasible  if  the  system  is  not  very  large. 
Iterative  methods,  on  the  other  hand,  are  more  suitable  for  very  large  sparse 
systems,  but  lack  robustness. 

Preconditioning  aims  to  improve  the  robustness  of  iterative  methods  by 
transforming  the  system  into 

hr1  Ax  =  M~lf ,  or  AM~1(Mx)  =  /.  (3.2) 

Here,  the  preconditioner  M  is  designed  in  such  a  way  that  the  coefficient  matrix 
AM -1  or  A1~1A  has  more  desirable  properties.  In  this  report,  without  loss  of 
generality,  we  use  the  term  preconditioning  to  refer  to  left  preconditioning.  Here, 
M-1  is  an  approximation  to  in  the  sense  that  the  eigenvalues  of  M~XA  are 
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clustered  around  1.  In  addition,  the  action  of  ill'1  on  a  vector  should  be  cheap 
and  suitable  for  parallel  computing. 

An  important  class  of  preconditioners  is  based  on  approximate  LU- 
factorization.  These  ILU  preconditioners  are  well  studied  and  used  successfully 
to  precondition  various  classes  of  linear  systems  [17,  35,  82].  There  also  exist 
dedicated  software  packages  that  are  commonly  used,  including  the  ILUT  pack¬ 
age  [63].  More  recently,  Benzi  et  al.  investigated  the  role  of  reordering  on  the 
performance  of  ILU  preconditioners  [6,  7].  Banded  preconditioners  provide  an 
effective  alternative  to  ILU-type  preconditioners  because  of  their  suitability  for 
implementation  on  parallel  architectures. 

3.2.1  Banded  preconditioners 

A  preconditioner  M  is  banded  if 

kl  >  i —  j  and  ku  >  j  —  i  =>  mij  yf  0,  (3-3) 

where  kl  and  ku  are  the  lower  and  upper  bandwidths,  respectively.  The  half¬ 
bandwidth,  k,  is  defined  as  the  maximum  of  kl  and  ku,  and  the  bandwidth  of 
the  matrix  is  equal  to  kl  +  ku  + 1.  Banded  preconditioners  are  generalizations  of 
diagonal  and  tridiagonal  preconditioners  and  are  shown  to  be  effective  in  solving 
various  problems  [57,  61,  77].  However,  banded  preconditioners  alone  are  not 
suitable  as  black-box  preconditioners,  since  a  dominant  band  does  not  always 
exist  without  preprocessing.  Therefore,  it  is  necessary  to  reorder  the  rows  and 
columns  of  a  matrix  to  pack  them  in  a  narrow  central  band  that  can  be  used  as  a 
preconditioner.  In  general,  reordering  algorithms  solve  an  optimization  problem 
that  can  be  defined  as  one  of  finding  a  permutation  of  the  rows  and  columns 
that  minimizes  the  bandwidth.  Our  approach  is  based  on  a  generalization  of 
this  approach  in  that  we  aim  to  minimize  the  bandwidth  that  encapsulates  a 
portion  of  the  non-zeros  in  the  matrix,  rather  than  all  non-zeros. 


3.3  Weighted  Bandwidth  Reduction 

3.3.1  Non-symmetric  reordering 

We  first  apply  a  non-symmetric  row  permutation  of  (3.1)  as  follows: 

QAx  =  Qf.  (3.4) 

Here,  Q  is  a  permutation  matrix.  We  consider  two  cases:  (i)  find  a  Q  such  that 
the  main  diagonal  contains  as  many  non-zeros  as  possible,  (ii)  find  a  Q  such 
that  the  product  of  the  absolute  values  of  the  diagonal  entries  is  maximized  [26] . 
Optionally,  the  second  approach  can  provide  scaling  factors  so  that  the  absolute 
values  of  the  diagonal  entries  are  equal  to  one  and  all  other  elements  are  less 
than  or  equal  to  one.  Scaling  is  applied  to  the  original  coefficient  matrix  as 
follows: 

(QD2AD1)(D^1x)  =  (QD2f).  (3.5) 
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The  first  algorithm  is  known  as  maximum  traversal  search.  Both  algorithms 
are  implemented  in  the  MC64[25l  subroutine  of  the  Harwell  Subroutine  Library 
HSL[45], 

3.3.2  Symmetric  reordering 

After  the  above  non-symmetric  reordering  and  optional  scaling,  we  apply  a 
symmetric  reordering  as  follows: 

(PQD2AD1Pt)(PD^1  x)  =  ( PQD2f ).  (3.6) 

where  P  is  a  permutation  matrix.  We  use  Weighted  Spectral  Reordering  (WSO) 
to  find  a  permutation  that  minimizes  the  bandwidth  encapsulating  a  specified 
fraction  of  the  total  magnitude  of  non-zeros  in  the  matrix.  “WSO”  is  described 
in  detail  in  the  next  section. 

Weighted  spectral  reordering  (WSO) 

Traditional  reordering  algorithms,  such  as  Cuthill-McKee  [20]  and  spectral  re¬ 
ordering  [4],  aim  to  minimize  the  bandwidth  of  a  matrix.  The  half-bandwidth 
of  a  matrix  A  is  defined  as 

BW(A)=  max  \i  —  j\,  (3.7) 

i,r-A(i,j)>  o 

i.  e.,  the  maximum  distance  of  a  nonzero  entry  from  the  main  diagonal.  These 
methods  aim  to  pack  the  non-zeros  of  a  sparse  matrix  into  a  narrow  band  around 
the  diagonal. 

These  methods  do  not  take  into  account  the  magnitude  of  nonzero  entries. 
However,  significantly  degrade  the  performance  ( e.g .,  convergence  rate)  of  the 
algorithm  at  hand,  particularly  for  matrices  whose  entries  vary  significantly  in 
magnitude. 

To  address  this  problem,  we  introduce  the  weighted  bandwidth  reduction 
problem,  which  aims  at  packing  the  heaviest  elements  of  the  matrix  in  a  narrow 
band.  This  results  in  a  less  constrained  formulation  of  the  bandwidth  reduction 
problem.  In  other  words,  the  goal  of  weighted  bandwidth  reduction  is  to  obtain 
a  preconditioner  with  minimal  bandwidth,  which  adequately  approximates  the 
matrix  at  hand,  rather  than  minimizing  the  bandwidth  of  the  entire  matrix. 
More  precisely,  for  matrix  A,  and  specified  error  bound  e,  we  define  the  weighted 
bandwidth  reduction  problem  as  one  of  finding  a  matrix  M  with  minimum 
bandwidth,  such  that 


Tnj  \A(iJ)\  i,j 
Ei,j  \ 


(3.8) 


The  idea  behind  this  formulation  is  that,  if  a  significant  part  of  the  matrix  is 
packed  in  a  narrow  band,  then  the  rest  of  the  non-zeros  can  be  dropped  to  obtain 
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an  efficient  preconditioner,  while  keeping  the  effect  of  the  resulting  perturbation 
within  a  specified  bound. 

In  order  to  find  a  heuristic  solution  to  the  weighted  bandwidth  reduction 
problem,  we  use  a  generalization  of  spectral  reordering.  Spectral  reordering 
is  a  linear  algebraic  optimization  technique  that  is  commonly  used  to  obtain 
approximate  solutions  to  various  intractable  graph  optimization  problems  [44]. 
It  is  also  successfully  applied  to  the  bandwidth  and  envelope  reduction  problems 
for  sparse  matrices  [4].  The  core  idea  of  spectral  reordering  is  to  compute  a 
vector  x  that  minimizes 

°a{x)=  (*(*)  -  x(j)fy  (3-9) 

i,r-A(i,j)>0 

where  |  |rc| 1 2  =  1-  Here,  it  is  assumed  that  the  matrix  A  is  symmetric.  The 
vector  x  that  minimizes  a  a  (x)  provides  a  mapping  of  the  rows  (and  columns) 
of  the  matrix  A  to  a  one-dimensional  Euclidean  space,  such  that  pairs  of  rows 
that  correspond  to  non-zeros  are  located  as  close  as  possible  to  each  other. 
Consequently,  the  ordering  of  the  entries  of  the  vector  x  provides  an  ordering 
of  the  matrix  that  significantly  reduces  the  bandwidth. 

Fiedler  [29]  showed  that  the  optimal  solution  to  this  problem  is  given  by  the 
eigenvector  corresponding  to  the  second  smallest  eigenvalue  of  the  Laplacian 
matrix,  where  the  Laplacian  L  of  a  matrix  A  is  defined  as 

=  if  i  j  A  A(i,j)  >  0  ,  , 

L(i,i)  =  \{j:A(i,j)>0}\. 

Note  that  the  matrix  L  is  positive  semi-definite,  so  the  smallest  eigenvalue  of 
this  matrix  is  equal  to  zero.  The  eigenvector  x  that  minimizes  <7A{x )  =  xT Lx 
is  known  as  the  Fiedler  vector.  The  Fiedler  vector  of  a  sparse  matrix  can  be 
computed  efficiently  using  iterative  algorithms  [49] . 

While  spectral  reordering  is  shown  to  be  effective  in  bandwidth  reduction, 
the  classical  spectral  approach  described  above  ignores  the  magnitude  of  non¬ 
zeros  in  the  matrix.  Therefore,  it  is  not  directly  applicable  to  the  weighted 
bandwidth  reduction  problem.  However,  Fiedler’s  result  can  be  generalized  for 
the  weighted  case  [14].  More  precisely,  the  eigenvector  x  that  corresponds  to 
the  second  smallest  eigenvalue  of  the  weighted  Laplacian  L  minimizes 

aA(x)  =  xTLx  =  I ^(bj)IO(d)  -  XU))2,  (3-11) 

hj 


where  L  is  defined  as 


L{i,j)  =  -\A(i,j)\  if  i^j 

L(M)  =  Ej  \A(iJ)\- 


(3.12) 


We  now  show  how  weighted  spectral  reordering  can  be  used  to  obtain  a 
continuous  approximation  to  the  weighted  bandwidth  reduction  problem.  For 
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this  purpose,  we  first  define  the  relative  band- weight  of  a  given  band  in  the 
matrix  as  follows: 


wk{A) 


Hi.j:,  j<k  \Mhj)\ 
£i,j  \A(i,j)\ 


(3.13) 


In  other  words,  the  band-weight  of  a  matrix  A ,  with  respect  to  an  integer  k,  is 
equal  to  the  fraction  of  the  total  magnitude  of  entries  that  are  encapsulated  in 
a  band  of  half- width  k. 

For  a  given  a,  0  <  a  <  1,  we  define  a-bandwidth  as  the  smallest  half¬ 
bandwidth  that  encapsulates  a  fraction  a  of  the  total  matrix  weight,  i.e., 


BWa(A)  =  min  k.  (3.14) 

k:Wk  (-A)>a: 

Observe  that  a-bandwidth  is  a  generalization  of  half-bandwidth,  i.e.,  when  a  = 
1,  the  a-bandwidth  is  equal  to  the  half-bandwidth  of  the  matrix. 

Now,  for  a  given  vector  x  G  Rn,  define  an  injective  permutation  function 
7r(i)  :  {1,2 ,...,n}  — >  {1,2,  such  that,  for  1  <  i,j  <  n,  x(n(i ))  <  x(7r (j)) 

iff  i  <  j.  Here,  n  denotes  the  number  of  rows  (columns)  in  matrix  A.  Moreover, 
for  fixed  k,  define  the  function  5k(i,  j)  :  {1,2, ...,  n}  x  {1,  2, ...,  n}  — >  {0, 1},  which 
quantizes  the  difference  between  n(i)  and  tt (j)  with  respect  to  k,  i.e.,, 

««)={;  (3,i5) 


Let  A  be  the  matrix  obtained  by  reordering  the  rows  and  columns  of  A 
according  to  7 r,  i.e., 

A(ir(i),ir(j))  =  A(i,j)  for  1  <  i,  j  <  n.  (3.16) 

Then,  5k(i,  j )  =  0  indicates  that  A(i,  j)  is  inside  a  band  of  half-width  k  in  matrix 
A,  while  Sk{i,j )  =  1  indicates  that  it  is  outside  this  band.  Defining 

°k(A)  =  \A(i,j)\Sk(i,j),  (3.17) 

we  obtain 

ak(A)  =  (1  -  wk(A))  Y \A(i,j)\-  (3.18) 

i,j 

Therefore,  for  fixed  a,  the  a-bandwidth  of  matrix  A  is  equal  to  the  smallest  k 
that  satisfies  d^(fc)/  JT  j  \A(i,j)\  <  1  —  a. 

Observe  that  the  problem  of  minimizing  ax  (H)  is  a  continuous  relaxation  to 
the  problem  of  minimizing  ak{A)  for  given  k.  Therefore,  the  Fiedler  vector  of 
weighted  Laplacian  L  provides  a  good  basis  for  reordering  A  to  minimize  ak(A). 
Consequently,  for  fixed  e,  this  vector  provides  a  heuristic  solution  to  the  problem 
of  finding  a  reordered  matrix  A  with  minimum  (1  —  e)-bandwidth.  Once  this 
matrix  is  obtained,  we  compute  the  banded  preconditioner  M  as  follows: 

M  =  { M(i,j )  :  M(i,j)  =  A(i,j)  if  | *  -  j\  <  BW\-e{A),  0  else}.  (3.19) 
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Clearly,  M  satisfies  Equation  3.8  and  it  is  of  minimal  bandwidth. 

Note  that  spectral  reordering  is  defined  specifically  for  symmetric  matrices 
and  the  resulting  permutation  is  symmetric  as  well.  Since  the  main  focus  of 
this  study  is  on  non-symmetric  matrices,  we  generalize  spectral  reordering  to 
non-symmetric  matrices  by  computing  the  Laplacian  matrix  with  respect  to 
|A|  +  \AT\  instead  of  \A\.  Note  that,  this  formulation  results  in  a  symmet¬ 
ric  permutation  for  a  non-symmetric  matrix,  which  may  be  considered  over¬ 
constrained. 

3.3.3  Summary  of  banded  solvers 

On  parallel  platforms,  the  Spike  banded  solver  [10,  15,  23,  51,  67,  71]  is  shown 
to  have  excellent  scalability  [59,  60]  compared  to  ScaLapack  [11],  The  central 
idea  of  SPIKE  is  to  partition  the  matrix  so  that  each  processor  can  work  on  its 
own  part  and  communicates  with  other  processors  only  at  the  end  to  solve  a 
common  reduced  system.  The  size  of  the  reduced  system  is  determined  by  the 
bandwidth  of  the  matrix  and  the  number  of  partitions. 


3.4  Numerical  Experiments 

With  the  objective  of  establishing  superior  runtime  and  convergence  properties 
of  banded  preconditioners  for  a  broad  class  of  problems,  we  perform  detailed 
experimental  measurements. 

3.4.1  Experimental  setup  and  test  problems 

We  obtained  the  test  problems  from  the  University  of  Florida  Sparse  Matrix 
Collection  [21].  We  choose  moderately  large  matrices  with  a  larger  number  of 
non-zeros  (to  ensure  that  the  bandwidth  reduction  problem  is  not  trivial).  We 
did  not  include  those  matrices  that  can  be  reordered  into  a  narrow  band  using 
classical  reordering  schemes  such  as  Reverse  Cuthill-McKee  (RCM).  Further,  in 
order  to  demonstrate  the  effectiveness  of  our  methods,  we  chose  matrices  from 
various  applications  that  are  structurally  un-symmetric  and  have  zeros  on  the 
main  diagonal.  Therefore,  without  any  reordering,  they  are  difficult  to  solve 
using  banded  or  ILU  type  preconditioners.  Description  of  the  test  problems 
and  related  statistics  are  shown  in  Tables  3.1  and  3.2,  respectively. 

In  all  of  the  numerical  experiments,  we  use  the  BiCGStab  [78]  iterative  solver 
with  a  left  preconditioner.  We  terminate  the  iterations  when  |  |r*fe  1 1 oo /[  |r*o  |  |oo  < 
10-5.  The  right  hand  side  is  generated  from  a  solution  vector  of  all  ones  in 
order  to  ensure  that  /  €  span(A). 

Implementation  of  ILU  based  preconditioners  For  each  problem,  we 
apply  ILUT  to  precondition  the  system  as  follows.  For  non-symmetric  permu¬ 
tation,  we  implement  the  most  promising  technique  mentioned  in  [6,  7].  Namely, 
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Table  3.1:  Description  of  Test  Problems 


Number 

Name 

Application 

1 

2D  _540 1 9 _HIGHK 

Semiconductor  Device  Simulation 

2 

APPU 

NASA  App  benchmark 

3 

ASIC_680k 

Circuit  Simulation  Problem 

4 

BUNDLE1 

3D  Computer  Vision 

5 

DC1 

Circuit  Simulation  Problem 

6 

DW8191 

Dielectric  Waveguide 

7 

FEM_3D  .THERMAL  1 

Thermal  Problem 

8 

FINAN512 

Economic  Problem 

9 

FP 

2-D  Fokkcr  Planck  Equations 

10 

H20 

Theoretical/Quantum  Chemistry  Problem 

11 

MSC23052 

Symmetric  Test  Matrix  from  MSC/NASTRAN 

12 

RAEFSKY5 

Landing  Hydrofoil  Airplane  FSE  Model 

Table  3.2:  Properties  of  Test  Properties 


Number 

Name 

Dimension 

Non-zeros 

1 

2D_54019_HIGHK 

54,019 

996,414 

2 

APPU 

14,000 

1,853,104 

3 

ASIC_680k 

680,000 

2,638,997 

4 

BUNDLE  1 

10,581 

770,811 

5 

DC1 

116,835 

766, 396 

6 

DW8191 

8,192 

41,746 

7 

FEM_3D  .THERMAL  1 

17,880 

430,740 

8 

FINAN512 

74, 752 

596,992 

9 

FP 

7,548 

834, 222 

10 

H20 

67, 024 

2,216,736 

11 

MSC23052 

23,052 

1,142,686 

12 

RAEFSKY4 

19,779 

674, 195 
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we  use  the  non-symmetric  reordering  algorithm  available  in  MC64,  with  the  ob¬ 
jective  of  maximizing  the  absolute  value  of  the  product  of  diagonal  entries.  In 
addition,  we  obtain  scaling  factors  {D\  and  D 2).  After  non-symmetric  reorder¬ 
ing  and  scaling,  we  perform  RCM  reordering.  Finally,  we  obtain  the  ILUT 
factorization  using  the  scaled  and  reordered  coefficient  matrix.  We  refer  to  this 
method  as  ILUTI.  For  ILUTI,  we  report  the  MC64  time,  RCM  time,  ILUT  fac¬ 
torization  time,  and  the  solution  time,  which  involves  the  BiCGStab  iterations. 

We  also  use  the  ILUT  preconditioner  without  any  reordering  or  scaling, 
and  we  refer  to  this  approach  as  ILUT.  For  both  ILUTI  and  ILUT,  we  try 
various  maximum  fill-in  ( fillin )  and  drop  (droptol)  tolerance  values.  First,  we 
allow  a  maximum  f  illin  of  10,  000  per  row  for  all  problems,  and  a  droptol  of 
10'1  and  10-3.  Note  that,  even  though  the  maximum  fillin  per  row  is  quite 
generous,  the  actual  fillin,  see  Table  3.4,  due  to  dropping  never  reaches  the 
level  of  the  specified  maximum  f  illin.  In  addition,  as  a  separate  experiment, 
we  allow  fillin  =  k,  where  k  is  the  (1  —  e)-bandwidth  of  WSO,  for  a  specified 
bound  on  error  e.  In  this  case,  we  do  not  drop  any  elements.  By  considering 
various  combinations  of  the  ( f  illin ,  di'optol)  parameter  pair,  one  can  experiment 
to  find  the  optimal  values.  However,  it  turns  out  to  be  quite  expensive  to  form 
the  factorization  each  time  and  the  behavior  of  performance  with  respect  to 
varying  these  parameters  is  sometimes  counter-intuitive.  In  other  words,  more 
fillin  does  not  always  mean  improving  the  convergence  properties  as  previously 
reported  [48,  64]. 

Implementation  of  weighted  bandwidth  reduction  based  precondi¬ 
tioner  For  each  problem,  we  first  reorder  the  system  with  MC64  using  the 
variation  of  the  algorithm  that  maximizes  the  number  of  non-zeros  on  the  main 
diagonal.  Next,  we  reorder  the  resulting  system  with  spectral  reordering  using 
MC73[46],  which  is  available  in  HSL.  We  use  the  default  parameters  of  MC73. 
After  obtaining  the  reordered  system,  we  determine  the  (1— e)-bandwidth,  where 
e  specifies  the  desired  bound  on  the  relative  difference  between  the  precon¬ 
ditioner  and  the  original  matrix.  Note  that  the  computation  of  the  (1  —  e)- 
bandwidth  requires  0{nnz)  +  0(n)  time  and  0(n)  storage.  To  achieve  this,  we 
first  create  a  work  array,  w(l  :  n),  of  dimension  n.  Then,  for  each  nonzero, 
ciij,  we  update  w(abs(i  —  j))  =  w(abs(i  —  j))  +  absiatj).  Finally,  we  compute  a 
cumulative  sum  in  w  starting  from  index  0  until  (1  —  e)-bandwidth  is  reached. 
The  time  for  this  process  is  included  in  the  LAPACK  time.  In  this  set  of  exper¬ 
iments,  we  choose  e  =  1CU4.  In  general,  the  corresponding  (1  —  e)-bandwidth 
can  be  as  large  as  n.  To  prevent  this,  we  place  an  upper  limit  k  <  50  if  the 
matrix  dimension  is  greater  than  10,  000.  In  Table  3.3,  the  achieved  values  of 
the  (1  —  e)-bandwidth  are  given  for  each  of  the  test  problems. 

After  determining  k,  the  WSO  scheme  proceeds  with  the  extraction  of  the 
banded  preconditioner  with  the  bandwidth  2  x  k  +  1.  Again,  this  process  is 
inexpensive  and  its  cost  is  reported  together  with  the  LAPACK  time.  After 
factorization  of  the  preconditioner  via  an  LU-scheme  without  pivoting,  we  use 
it  to  accelerate  the  BiCGStab  iterations. 
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Solution  of  linear  systems  For  both  ILUTI  and  WSO,  we  obtain  a  solution 
to  (3.6)  by  solving  the  following  reordered  and  scaled  (if  any)  system 


Ax  =  f.  (3.20) 

Here,  A  =  PQD2AD\PT ,  x  =  PDf^x,  and  /  =  PQD2f .  The  corresponding 
scaled  residual  is  given  by  f  =  f  —  Ax,  where  x  is  the  computed  solution.  One 
can  recover  the  residual  of  the  original  system  (3.1),  r,  as  follows: 


f  =  PQD2f  -  PQD2AD1PTPDf1x 

(3.21) 

=>f  =  PQD2f  -  PQD2Ax 

(3.22) 

=>  r  =  PQD2{f  -  Ax) 

(3.23) 

=>  r  =  PQD2r 

(3.24) 

=>  r  =  Df1QTPTr. 

(3.25) 

At  each  iteration,  we  check  the  unsealed  relative  residuals  using  (3.25).  Note 
that  a  permutation  of  the  inverted  scaling  matrix  is  computed  once  and  used  for 
recovering  the  unsealed  residual  at  each  iteration.  This  process  is  included  in  the 
total  time,  but  the  cost  is  minimal  compared  to  sparse  matvecs  and  triangular 
solves.  During  our  tests,  we  enforce  a  time  limit  of  10  minutes,  and  consider 
the  particular  run  as  failure  if  it  exceeds  this  limit. 


Table  3.3:  Values  of  fillin  f  and  bandwidth  k  for  the  test  problems 


Number 

Name 

fma x(fillin) 

k(a-bandwidth) 

1 

2D_54019 J3IGHK 

10,000 

2 

2 

APPU 

10,000 

50 

3 

ASIC_680k 

10,000 

2 

4 

BUNDLE  1 

10,000 

50 

5 

DC1 

10,000 

50 

6 

DW8191 

10,000 

190 

7 

FEM_3D_THERMAL  1 

10,000 

50 

8 

FINAN512 

10,000 

50 

9 

FP 

10,000 

2 

10 

H20 

10,000 

50 

11 

MSC23052 

10,000 

50 

12 

RAEFSKY4 

10,000 

50 

3.4.2  Comparative  analysis 

In  Figure  3.1,  we  compare  the  normalized  time  (with  respect  to  the  banded 
preconditioner)  of  ILUT  and  ILUTI  using  droptol  =  10-1  and  fillin  =  fmax- 
ILUT  fails  in  four  of  the  test  cases  and  diverges  in  one  case,  while  ILUT  does 
not  fail  but  diverges  for  three  of  the  test  problems.  WSO,  on  the  other  hand, 
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Table  3.4:  Number  of  nonzeros  in  matrix  L  +  U,  for  various  ILUT  precondi- 
tioners _ 


Matrix 

ILUT 

(10 1,  fmax) 

ILUTI 

(10 1,  fmax) 

ILUT 

(10-3,  fmax) 

ILUTI 

(10-3,  fmax) 

ILUT 

(0  ,k) 

ILUTI 
(0  ,k) 

1 

- 

419,835 

- 

1,503,034 

1,213,809 

1,217,347 

2 

2,377,197 

2,310,466 

- 

- 

- 

- 

3 

- 

2,473,454 

- 

2,783,146 

- 

4,457,554 

4 

347,395 

290,337 

10,247,157 

1,368,809 

1,815,975 

992,708 

5 

607,692 

707, 162 

1,230,897 

1,100,108 

- 

- 

6 

43,702 

54, 189 

476,802 

1,239,229 

1,577,605 

1,626,048 

7 

247, 643 

256,741 

762,011 

695, 664 

2,225,895 

2,217,004 

8 

472,830 

478, 568 

2,336,798 

1,427,693 

5,191,052 

5,059,937 

9 

- 

2,875,303 

2,104,578 

4,301,556 

892,662 

892,129 

10 

1,306,190 

1,249,849 

7,602,509 

7,316,812 

- 

- 

11 

- 

26,402,760 

4,515,416 

27,939,415 

2,417,059 

3,294,876 

12 

923,285 

858,737 

5,166,173 

3,911,335 

3,194,353 

3,296,441 

succeeds  in  11  of  the  12  test  cases.  For  the  other  test  problem,  namely  the 
matrix  M S'C'23052,  WSO  reaches  a  relative  residual  of  1.6  x  10-2,  while  ILUT 
and  ILUTI  either  diverge  or  fail.  WSO  is  the  fastest  method  based  on  the  total 
solution  time  for  7  out  of  the  12  cases. 

In  Figure  3.2,  we  compare  the  normalized  time  (with  respect  to  the  banded 
preconditioner)  of  ILUT  and  ILUTI  using  droptol  =  10-3  and  fillin  =  fmax- 
In  this  case,  ILUT  fails  or  diverges  for  four  of  the  test  cases,  and  it  exceeds 
the  time  limit  for  one  test  case.  ILUTI,  on  the  other  hand,  diverges  for  three 
cases  and  stagnates  for  one.  WSO  is  the  the  fastest  method  based  on  the  total 
solution  time  for  9  out  of  the  12  cases. 

In  Figure  3.3,  we  compare  the  normalized  time  (with  respect  to  the  banded 
preconditioner)  of  ILUT  and  ILUTI  using  droptol  =  0  and  f  illin  =  k.  Here, 
k  =  kl  =  ku  is  the  upper  (or  lower)  bandwidth  of  the  banded  preconditioner. 
ILUT  diverges  in  three  of  the  test  cases,  exceeds  the  time  limit  for  three  others, 
and  fails  for  one.  ILUTI,  on  the  other  hand,  diverges  for  two  of  the  test  cases 
and  exceeds  the  time  limit  for  three  others.  WSO  is  the  fastest  method  based 
on  the  total  solve  time. 

In  Table  3.5,  we  present  details  of  the  experimental  results.  Notice  that 
there  is  no  failure  for  WSO,  whereas,  ILUTI  and  ILUT  have  failures.  If  the 
ILUT  factorization  returns  with  an  error  code,  we  indicated  this  by  an  F  in  the 
table.  If  BiCGStab  stagnates  (indicated  by  Stag ,  or  *),  then  we  consider  that 
the  method  did  not  converge  to  the  desired  relative  residual  of  10 _5  after  500 
iterations. 

In  Figures  3.4-3.15,  we  present  the  residual  histories  for  12  systems  using 
ILUT,  ILUTI,  and  WSO.  The  numbers  of  iterations  for  these  methods  are  com¬ 
parable.  However,  WSO  is  faster  due  to  higher  FLOP  counts  and  better  con¬ 
currency. 
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WSO  vs  ILUT(droptol  =0.1,  fillin  =  n) 


Figure  3.1:  Comparison  of  performance  of  WSO  banded  preconditioner  with 
ILUT(fmax,  10- V 


Table  3.5:  Comparison  of  ILUT,ILUTI,  and  WSO:  Number  of  Iterations  (Total 
Solve  Time) 


Matrix 

ILUT 

(io -\/) 

ILUTI 

(io-\/) 

ILUT 

(io -3,/) 

ILUTI 

(io -3,/) 

ILUT 

(0  ,k) 

ILUTI 

(0,  k) 

WSO 

1 

F 

20(0.53) 

F 

6(0.58) 

19(2.87) 

1(2.13) 

1(0.33) 

2 

12(2.22) 

11(2.20) 

>  10  min 

>  10mm 

>  10mm 

>  10mm 

23(1.8) 

3 

F 

2(29.92) 

F 

2(29.94) 

F 

2(219.2) 

7(10.32) 

4 

6(0.76) 

7(0.31) 

5(35.12) 

3(2.07) 

6(70.51) 

2(2.64) 

19(1.18) 

5 

18(9.5) 

40(6.82) 

8(12.94) 

14(8.17) 

>  lOrmn 

>  10mm 

25(5.54) 

6 

Div. 

Div. 

500(3.71)* 

Stag. 

Div. 

1(1.09) 

16(0.52) 

7 

9(0.09) 

11(0.16) 

3(0.19) 

3(0.20) 

2(13.19) 

2(4.94) 

37(1.09) 

8 

5(0.12) 

5(0.2) 
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WSO  vs  ILUT(droptol  =0.001,  fillin  =  n  ) 


Figure  3.2:  Comparison  of  performance  of  WSO  banded  preconditioner  with 
ILUT(fmax,  10~3) 
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WSO  vs  ILUT(droptol  =0.0,  fillin  =  bw) 


Figure  3.3:  Comparison  of  performance  of  WSO  banded  preconditioner  with 

nuT(k,  o.o; 


42 


Figure  3.4:  Residual  history  of  WSO  banded  preconditioner  for  problem 
2D_54  01 9_HIGHK 


Figure  3.5:  Residual  history  of  WSO  banded  preconditioner  for  problem  Appu 
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Figure  3.6:  Residual  history  of  WSO  banded  preconditioner  for  problem 
ASIC-680k 


Figure  3.7:  Residual  history  of  WSO  banded  preconditioner  for  problem  BUN- 
DLE1 
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Figure  3.8:  Residual  history  of  WSO  handed  preconditioner  for  problem  DC1 


Figure  3.9:  Residual  history  of  WSO  handed  preconditioner  for  problem  DW8192 
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Figure  3.10:  Residual  history  of  WSO  banded  preconditioner  for  problem 
FEM_3D_  THERMAL 


Figure  3.11:  Residual  history  of  WSO  banded  preconditioner  for  problem  FT 
NAN512 
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Figure  3.12:  Residual  history  of  WSO  banded  preconditioner  for  problem  FP 
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Figure  3.13:  Residual  history  of  WSO  banded  preconditioner  for  problem  H20 
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Figure  3.14:  Residual  history  of  WSO  banded  preconditioner  for  problem 
MSC23052 


Figure  3.15:  Residual  history  of  WSO  banded  preconditioner  for  problem  RAEF- 
SKYf 
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Chapter  4 


A  Tearing-based  Hybrid 
Parallel  Banded  Linear 
System  Solver 

4.1  Introduction 

Numerical  handling  of  partial  differential  equations  (PDEs)  plays  a  crucial  role 
in  modeling  of  physical  processes.  It  involves  discretization  of  these  PDEs  using, 
for  example,  Finite  Difference  or  Finite  Element  methods  and  results  in  nonlin¬ 
ear  systems  of  equations  whose  solution  yields  at  each  iteration  a  large  sparse 
linear  system.  These  systems  can  often  be  reordered  using  Reverse  Cuthill- 
McKee  [20,  53]  or  Spectral  [29,  30,  72]  reorderings  into  banded  or  low-rank 
perturbations  of  banded  linear  systems  and  solved  using  direct  or  precondi¬ 
tioned  iterative  methods  in  which  the  preconditioner  is  a  banded  matrix.  In 
this  Chapter  we  propose  a  novel  hybrid  parallel  algorithm  for  solving  banded 
linear  systems  and  state  conditions  that  guarantee  its  convergence. 

The  parallel  solution  of  banded  linear  systems  has  been  considered  by  many 
authors  [19,  23,  24,  34,  37,  47,  55,  59,  60,  67,  81].  The  overarching  strategy 
consists  of  two  main  stages:  (i)  the  coefficient  matrix  is  reordered  or  modified 
so  that  it  consists  of  several  independent  blocks,  but  which  are  interconnected 
by  a  single  block.  Certain  algorithms  produce  this  single  block  as  a  Schur  com¬ 
plement,  others  produce  different  reduced  systems;  (ii)  once  this  reduced  system 
corresponding  to  this  single  block  is  solved,  the  original  problem  decomposes  into 
several  independent  smaller  problems  facilitating  almost  perfect  parallelism  in 
retrieving  the  rest  of  the  solution  vector. 

Our  approach  is  different  from  the  algorithms  cited  above.  Its  main  idea 
was  first  proposed  by  Sameh  et  al.,  [54]  in  which  the  study  was  restricted  to 
diagonally  dominant  symmetric  positive  definite  linear  systems.  In  this  chapter 
we  generalize  it  to  non-symmetric  linear  systems  without  the  requirement  of 
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diagonal  dominance.  Furthermore,  we  show  how  to  precondition  the  implicit 
balance  system. 

The  rest  of  this  chapter  is  organized  as  follows:  First,  we  introduce  the  al¬ 
gorithm  by  showing  how  it  “tears”  the  original  system.  Second,  we  analyze 
the  conditions  that  guarantee  the  non-singularity  of  the  balance  system  for  any 
nonsingular  original  system.  Further,  we  show  that  if  the  original  system  is 
Symmetric  Positive  Definite  (SPD)  then  the  much  smaller  balance  system  is 
SPD  as  well.  Third,  we  discuss  preconditioned  iterative  methods  for  solving  the 
balance  system,  the  Conjugate  Gradient  (CG)  for  the  SPD  case,  and  the  Stabi¬ 
lized  Bi-Conjugate  Gradient  (BiCGStab)  for  the  non-symmetric  case.  We  call 
our  algorithm  Domain  Decomposition  CG  (DDCG)  or  Domain  Decomposition 
BiCGStab  (DDBiCGStab)  for  symmetric  and  non-symmetric  linear  systems, 
respectively.  Finally,  we  present  numerical  experiments  and  pseudocode. 


4.2  Partitioning 

We  are  interested  in  solving  linear  systems 


.Ax  —  f 


(4.1) 


where  A  €  Knxn  and  x,  f  £  Rn.  Let  A  =  [ay]  have  a  lower  and  upper  band- 
widths  of  fed,  in  other  words  \i  —  j\  <  kd  <C  n.  We  can  rewrite  our  banded  linear 
system  (4.1)  using  a  block  tridiagonal  matrix  A,  blocked  vectors  x  and  f. 

For  clarity  of  presentation,  we  illustrate  the  partitioning  and  tearing  scheme 
using  three  partitions  ( p  =  3).  For  the  same  reason  some  of  our  theorems 
are  proven  only  for  three  partitions.  Also,  we  assume  that  all  the  partitions 
are  of  equal  size,  m,  and  that  all  the  overlaps  are  of  identical  size  r  =  kd. 
Generalization  to  the  case  of  p  >  3  partitions  of  different  sizes  is  straightforward. 

The  banded  matrix  A  and  vectors  x  and  f  can  be  written  as 


A  = 


/  xi  \ 

x2 

*3 
x4 

X.5 

X6 

V  x7  y 


and 


\ 


/ 


(4.2) 
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Here,  Atj ,  x,;  and  f,  for  i,j  =  1, . . . ,  7  are  blocks  of  appropriate  sizes.  Let  us 
define  the  partitions  of  A,  delineated  by  lines  in  the  illustration  of  A  in  (4.2),  as 


Ak 


(  qO)  jO) 

^11 

Ak)  Ak) 

d-21  ^22 

A(k) 


A 

A 


for  At  =  1,2,3 


(4.3) 


If  ?7  =  2 (k  —  1),  then  the  blocks  A$  =  Av+m+v,  for  /x,  v  =  1,2,3,  except  for 
the  top  left  /x  =  v  ^  1  block  of  the  last  and  middle  partition,  and  bottom  right 
=  v  ^  3  block  of  the  first  and  middle  partition.  For  these  blocks  the  following 
equality  holds  Hgg”11  +  A^  =  Av+  i,v+i-  The  exact  choice  of  splitting  Hgg-1^ 
and  will  be  discussed  below. 

Thus,  we  can  rewrite  (4.1)  as  3  independent  linear  systems,  k  =  1,2,3, 


/ 

^11 

j(*) 

^12 

\ 

(  xf> 

Aw 

^21 

^22 

4fc3} 

(fc) 

x2 

V 

^32 

4fc3}  J 

(fc) 

V  x3 

(1  -  ak- 1)  frj+l  -  y^-i 

f»)+2 

ak^ri+3  +  y  k 


(4.4) 


where  yk  and  ak  are  yet  to  be  specified.  Now,  we  need  to  choose  the  scaling 
parameters  0  <  ai,a2  <  1  and  the  adjustment  vector  yT  =  (y^y^)  so  that 
the  solution  of  (4.4)  coincides  with  the  respective  parts  of  the  solution  of  (4.1). 
We  will  achieve  this  if 


d1) 


A  2) 


and  Xg2"*  =  x)‘ 


,0) 


(4.5) 


Note  that  ao  =  0,  a3  =  1,  y0  =  y3  =  0,  and  without  loss  of  generality,  we 
choose  «i  =  «2  =  0.5.  Let 


^fc1 


/  B ^ 
^>21 
V  4i] 


(k)  r>(k) 


B\2 

d(^) 

^22 

B^k) 

n32 


B\3 

f?(k) 

£>23 

B^k) 

n33 


we  can  rewrite  (4.4)  as 


/  xifc) 


/  p(^)  p(^) 

/  -Ou  -d12 

(k) 


\  B  {£ 


B^k) 
n22 
r?(k) 
^32 


obtaining 

[xf> 

1  v(fe) 


=  H^((  1  -  at_!)f,+1  -  yfc_r)  +  B™ f,+2  +  B^(afcf,+3 
=  fl<*>((  1  -  afc_i)f„+i  -  yfc_i)  +  B32  ^rj+2  +  B^(akfv+3 


(1  -  Ofc-l)fr)+l  -  yfc_i 
frj+2 

Ctfcfrj+S  +  yfc 


?(fe)  < 


Then,  using  (4.5)  and  (4.8)  we  obtain 


(4.6) 


(4.7) 


■yfc) 

yfc) 


(4.8) 


B 


(0 

33 


/j 


(C+1) 


yc  =  gc  +  B^y  c_i  +  H^+1)yc+i 


(4.9) 
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for  £  =  1,2,  where 


/ 

gc  =  ((ac_r  -  acB$ ,B[i+1\ac+1B[i+1)) 

V 

(4.10) 

Finally,  letting  gT  =  (gf,  g^),  the  adjustment  vector  y  can  be  found  by  solving 
the  balance  system 

My  =  g  (4.11) 


fr;— 1 

K 

f?7+l 

f?7+2 

f?7+3 


where 


M  = 


l1)  _i_  b ^  \ 

33  ^  -°11  'd13  I 

d(2)  p(2)  .  p(^)  I 

^33  +  -#11  / 


(4.12) 


Once  y  is  found,  we  can  solve  the  three  independent  linear  systems  (4.4)  in 
parallel.  Next,  we  focus  our  attention  on  solving  (4.11).  First,  we  note  that  the 
matrix  M  is  not  available  explicitly.  Thus  using  a  direct  method  to  solve  the 
balance  system  (4.12)  is  not  possible  and  hence  we  need  to  resort  to  iterative 
schemes.  Our  iterative  methods  of  choice  for  M  are  CG  and  BiCGStab,  respec¬ 
tively.  However,  to  use  them,  we  must  be  able  to  compute  the  initial  residual 
ro  =  g  —  M y0  and  compute  matrix  vector  products  of  the  form  q  =  Mp.  Also, 
we  need  to  determine  what  conditions  A  and  the  partitions  A/:  must  satisfy  for 
M  to  be  nonsingular.  This  will  be  addressed  in  the  next  section. 

As  a  final  note,  we  mention  that  in  general  (for  arbitrary  p)  the  balance 
system  is  block  tridiagonal  of  the  form 


(  Bi $  +  b[? 

~ b ^ 

\ 

-b£} 

r(2)  r(3) 

,33  ^  -°11 

-B&-3)  Bt2)  +  B[r1] 

' 

W? 

1 

—B. 


(p-i) 

31 


B. 


(p-i) 

33 


+  B$>  ) 


(4.13) 


4.3  The  Balance  System 

4.3.1  The  symmetric  positive  definite  case 

In  this  section,  we  assume  that  A  is  SPD,  and  investigate  the  conditions  under 
which  the  balance  system  is  also  SPD. 

Theorem  1  If  partitions  Ak  in  (4.3)  are  SPD  for  k  =  1, . . .  ,p  then  the  coeffi¬ 
cient  matrix  M ,  of  the  balance  system  in  (4.11),  is  SPD. 
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Let  p  =  3.  Notice  that  if  Ak  is  SPD  then  Ak  1  is  also  SPD.  Let 


qt  = 


I 

o 


(4.14) 


Pre-multiplying  (4.6)  by  PT  and  post-multiplying  by  P,  we  obtain 

(  d(^)  _  p(^)  \ 

°TV0=(  <  ]'  (415) 

which  is  also  SPD.  However,  since  the  matrix  M  can  be  written  as  the  sum 


M 


Mi  +  M2  +  M3 


0 


0  0 


B 


11 


(2) 

11 

(2) 

31 


-B 


(2) 


B 


13 

(2) 

33 


0  0 
0  B[f 


(4.16) 

(4.17) 


zT Mz  >  0  for  any  nonzero  z. 

Let  A  be  SPD  and  DD,  then  we  can  find  a  splitting  that  results  in  SPD/DD 
Ak,  which  guarantees  that  M  is  SPD.  This  result  is  contained  in  the  following 
theorem. 


Theorem  2  If  A  in  (4.1)  is  SPD  and  DD  then  the  partitions  Ak  in  (4.3)  can 
be  chosen  such  that  they  inherit  the  same  properties.  Further,  the  coefficient 
matrix  M ,  of  the  resulting  balance  system  in  (4.11),  is  SPD. 

Since  A  is  DD,  we  only  need  to  obtain  a  splitting  that  ensures  the  diagonal 
dominance  of  the  overlapping  parts.  Let  e  =  [1, . . . ,  1]T,  e,  be  the  i-th  column 
of  the  identity,  |.|  denote  the  absolute  value,  diag(.)  and  offdiag(.)  denote  the 
diagonal  and  off  diagonal  elements,  respectively.  Now  let  the  elements  of  the 
diagonal  matrices  and  =  [/i^+1,2^],  of  appropriate  sizes,  be 

given  by 

hf1]  =  eJ\A^\e  +  |offdiag(H$  +  ^+1))|e  (4.18) 

hu+h2)  =  ef|4f2+1)|e+  ^efloffdia^  +4i+1))|e  (4.19) 

(4.20) 

respectively.  Notice  that  h  and  '1]  are  sums  of  absolute  values  of  all 
off  diagonal  elements,  with  elements  in  the  overlap  being  halved,  in  the  i-th  row 
to  the  left  and  right  of  the  diagonal,  respectively.  Moreover,  let  the  difference 
between  the  positive  diagonal  elements  and  the  sums  of  absolute  values  of  all 
off  diagonal  elements  in  the  same  row  be  given  by: 

Dc  =  diag(4|)  +  4C+1} )  -  H&  -  44 .  (4.21) 
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Now,  if 


As  =  H(1]  +  \Di  +  ^offdiag^^  +  ^u+1)).  and 

4i+1)  =  ^+)1  +  ^c  +  ^offdiag(4|)  +  4fi+1)), 

(4.22) 

it  is  easy  to  verify  that  4?  +  4i+^  =  ^2f+i,2C+i  and  each  -4 k,  for  k  =  1, . . .  ,p, 
is  SPD/DD.  Consequently,  if  (4.2)  is  SPD/DD  so  are  the  partitions  and  by 
Theorem  1,  the  balance  system  is  guaranteed  to  be  SPD. 

4.3.2  The  non-symmetric  case 

Next,  if  A  is  just  a  nonsingular  non-symmetric  matrix,  we  explore  conditions 
under  which  (4.12)  becomes  nonsingular. 

Theorem  3  Let  the  matrix  A  in  (4.1)  be  any  nonsingular  matrix  with  partitions 
Ak,  k  =  1, . . .  ,p,  in  (4.3)  that  are  also  nonsingular.  Then  the  coefficient  matrix 
M,  of  the  balance  system  in  (4.11),  is  nonsingular. 

Let  p  =  3.  Notice  that  we  can  write 

l  A\  \  /  0m-T  \ 

A  =  0  m-2r  +  A2  (4.23) 

\  A3  J  \  0 m-T  ) 

Next,  let  the  nonsingular  matrix  C  be  given  by, 


(4.24) 
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where  Im  and  Om  are  the  identity  and  zero  matrices  of  order  m,  respectively. 
Using  (4.23),  (4.24),  (4.6),  we  obtain 


C 


d(2) 

o(2) 

B12 

d(2) 

B13 

0 

Om— 2  r 

0 

d(2) 

B31 

d(2) 

B32 

d(2) 

B33 

where  the  0  matrices  without  subscripts  are  considered  to  be  of  appropriate 
sizes.  Using  the  permutation  matrix 
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we  can  write 


P1  CP  = 


’  It 

p(l) 

^13 

\ 

Im— 2t 

p(l) 

-^23 

Im—2r 

I’m — 2t 

o(3) 

“-^21 

Ir 

p(3) 

-^31 

&12 

r >(.1)  _j_  r>^) 
^33  +  *>11 

-^13 

n32 

d(2) 

“-O31 

p(2)  .  p(3) 
^33  +  -Oil 

/ 

V 

(4.26) 

or  rewriting  (4.26)  as  a  block  2x2  matrix,  with  blocks  delineated  by  lines  above, 
we  obtain 

pTCp  =  I  ^rn- 4r  |  |  (4.27) 


-*3m— 4r 

^1  ' 

,  ^2 

M  / 

Consider  the  eigenvalue  problem 


J-Sm— 4r 

Zi  ' 

M  / 

Ul 

U2 


=  A 


Ul 

U2 


(4.28) 


Pre-multiplying  the  first  block  row  of  (4.28)  by  Zj  and  noticing  that  Zj Z\  —  0 
we  obtain 

(l-A)ZjUl=0  (4.29) 

Hence,  either  A  =  1  or  Z^u\  =  0.  If  A  ^  1,  then  the  second  block  row  of  (4.28) 
yields 

Mu-2  =  Au2  (4.30) 

Thus,  the  eigenvalues  A  of  C  are  either  1  or  are  identical  to  those  of  the  balance 
system,  and  we  can  write: 


A  (C)  =  A  (PTCP)  €  {1,  A  (AT)}.  (4.31) 


Hence,  since  C  is  nonsingular,  the  balance  system  is  also  nonsingular. 

Notice  that  (4.24)  and  (4.31)  suggest  a  powerful  preconditioning  technique 
for  the  banded  linear  system  (4.1). 

Next,  we  explore  conditions  that  the  nonsingular  matrix  A  must  satisfy,  so 
that  we  are  guaranteed  that  there  exists  a  splitting  that  results  in  nonsingular 
partitions  A^.  We  also  provide  a  formula  for  computing  such  a  splitting. 
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Theorem  4  If  matrix  A  in  (4.1)  is  nonsingular,  its  symmetric  part 
H  =  ±(A  +  AT)  is  symmetric  positive  semidefinite  (SPSD)  and 

A f(H)  fl  A f(B)  =  0  for  B  =  B\  =  B%  where 


(  A(2) 


B i  = 


a: 


11 

(2) 


V 


a(3) 

aii 

a(3) 

A21 


A  (P) 
A11 
/((A 
A21 


l  4f 


,4T  = 


A 


(2)7 

12 


V 


a(3) 

A(3)7 

^12 


A(p) 

A(p)7 

^12 


) 


(4.32) 

i/ien  t/iere  exists  a  splitting  such  that  the  partitions  Ak  in  (4.3)  for  k  =  1, . . .  ,p 
are  nonsingular. 


Let  p  =  3.  Let  A  be  the  block  diagonal  matrix  in  which  the  blocks  are  the 
partitions  Ak, 


( 4? 
,(1) 
^21 


24(1) 

^12 

A(1) 

^22 

A(1) 

^23 

^32 

A(1) 

^33 

a(2) 

a(2) 

^12 

a(2) 

^21 

a(2) 

^22 

A (2) 
^23 

a(2) 

^32 

A2) 

^33 

4(3) 

4(3) 

^12 

4(3) 

-^21 

A(3) 

^22 

a(3) 

^23 

a(3) 

^32 

a(3) 

^33 

A  = 


V 

and  the  nonsingular  permutation  matrix  JT  be  defined  as 
(  It 

Im—2r 

It  It 


J1  = 


Im.  —  I 


Im.-1 


0  IT 


0  It 


\ 


(4.33) 


(4.34) 
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Then,  since  rank(A)  =  rank(JTAJ),  where 


where  B\  and  B2  contain  B\  and  B2,  respectively,  with  additional  zeros  at  the 
top  and  bottom.  Notice  that  using  splitting  =  ±(Av+iv+i  +  A^+1  +1)  +  01 
and  A33-1)  =  ^(Av+ iv+i  —  A^+1  +1)  —  (31  we  can  choose  /3  so  as  to  ensure  that 
B\  and  B2  are  of  full  rank  and  K  is  SPD.  Thus,  using  Theorem  3.4  on  p.  17 
of  [8],  we  conclude  that  A  has  full  rank  and  consequently  the  partitions  Ak  are 
nonsingular. 

Notice  that  the  condition  B  =  B\  =  B 2  in  the  theorem  above  is  not  as 
restrictive  as  it  looks.  Recalling  that  the  original  matrix  is  banded,  it  is  easy 
to  see  that  matrices  A\ f  and  A2f  are  almost  completely  zero  except  for  small 
parts  in  the  respective  corners  of  size  no  larger  than  the  overlap.  This  condition 
can  be  viewed  as  a  requirement  of  symmetry  surrounding  the  overlaps. 

Let  us  now  focus  on  two  special  cases.  First,  if  the  matrix  A  is  SPD  the 
conditions  of  the  above  theorem  are  immediately  satisfied  and  we  obtain  the 
following  Corollary. 

Corollary  1  If  the  matrix  A  in  (4.1)  is  SPD  then  there  exists  a  splitting  (as 
described  in  Theorem  f)  such  that  the  partitions  Ak  in  (4.3)  for  k  =  1, . . .  ,p  are 
nonsingular  and  consequently  the  coefficient  matrix  M ,  of  the  balance  system  in 
(4.11),  is  nonsingular. 

Second,  if  the  symmetry  requirement  is  dropped  in  Theorem  2,  then  combining 
the  results  of  Theorems  2  and  3  we  obtain  the  following. 

Corollary  2  If  A  in  (4.1)  is  DD  (hence  nonsingular)  then  the  partitions  Ak 
in  (4.3)  can  be  chosen  such  that  they  are  also  DD  (hence  nonsingular)  for  k  = 
1  and  consequently  the  coefficient  matrix  M,  of  the  balance  system  in 

(4.11),  is  nonsingular. 
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4.4  The  Hybrid  Solver  of  the  Balance  System 


Let  us  show  how  we  can  compute  the  initial  residual  ro  needed  to 
CG  or  BiCGStab  for  solving  the  balance  system.  Notice  that  we 
(4.7)  as 
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and  the  residual  can  be  written  as 
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Let  the  initial  guess  be  given  by  y0  =  0,  then  we  have 
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Thus,  to  compute  the  initial  residual  we  must  solve  the  p  independent  linear 


systems  and  subtract  the  bottom  part  of  the  solution  vector  of  partition  £, 
h|p,  from  the  top  part  of  the  solution  vector  of  partition  £  +  1,  h^+1\  for 

Finally,  to  compute  matrix- vector  products,  q  =  M p,  using  (4.39)  and  (4.40) 


we  obtain 


My  =  g-  r  =  r0-r  = 
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Hence,  we  can  compute  matrix- vector  products  M p,  for  any  vector  p,  in  a 
similar  fashion  as  the  initial  residual  using  (4.41)  and  (4.38). 

The  modified  iterative  methods  (CG  and  BiCGStab)  used  to  solve  (4.11)  are 
the  standard  iterative  methods  with  initial  residual  and  matrix-vector  products 
computed  using  (4.40)  and  (4.41),  respectively. 
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4.5  Preconditioning  the  balance  system 


We  precondition  the  balance  system  using  a  block  diagonal  matrix  of  the  form, 


(  BiV 
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(4.42) 


where  B ^  ss  B ^  =  Agg'1  and  ~  ^n+1'*  =  ^n+1'*  •  Here,  we  are 

taking  advantage  of  the  fact  that  the  elements  of  the  inverse  of  a  banded  matrix 
decay  as  we  move  away  from  the  diagonal,  e.g.  [22].  Also  such  decay  becomes 
more  pronounced  as  the  banded  matrix  is  more  diagonally  dominant.  Using 
Sherman-Morrison- Woodbury  formula,  [36] ,  we  can  write 
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where  we  prefer  the  last  equality  as  it  avoids  extra  interprocessor  communi¬ 
cation,  assuming  that  the  original  overlapping  block  is  stored 

separately  on  processor  ([.  Consequently,  to  precondition  (4.11)  we  only  need 
to  multiply  vectors  with  ,  and  solve  small  linear  systems  with  coefficient 
matrices  A^  +  A^+1\ 


4.6  Numerical  Experiments 

We  compare  the  performance  of  our  hybrid  solver  with  both  preconditioned 
iterative  solvers  as  well  as  direct  solvers.  Consequently,  in  this  section,  we 
compare  DDCG  and  DDBiCGStab  with  preconditioned  CG  and  BiCGStab,  as 
well  as  LAPACK  [1],  and  ScaLapack  [12].  These  solvers  have  been  tested  on  a 
variety  of  linear  systems.  Here  we  present  only  few  examples  whose  results  are 
typical  of  a  much  larger  collection.  The  six  test  problems  are  banded  systems 
extracted  from  two  large  sparse  matrices,  selected  from  The  University  of  Florida 
Sparse  Matrix  Collection  [21]  (see  Table  4.1).  First,  these  two  matrices  Ei 
for  i  =  1,2  are  reordered  using  the  Reverse  Cuthill-McKee  scheme  and  three 
central  bands  of  bandwidths  129,  257,  and  513  are  extracted  from  them.  The 
diagonal  elements  of  these  banded  systems  are  then  perturbed  so  that  the  three 
symmetric  matrices  Sj  are  SPD/DD  with  a  degree  of  DD  ^  1.008,  and  the  other 
three  non-symmetric  matrix  Nj  are  made  nonsingular.  Finally,  all  six  matrices 
are  equilibrated  to  produce  condition  numbers  of  ^  7.6E  +  3  for  the  matrices 
Sj,  and  ^  4. IE  +  6  for  the  matrices  Nj. 

Notice  that  diagonal  dominant  is  not  a  requirement  for  the  convergence  of 
DDCG;  we  only  need  the  partitions  to  be  SPD  (see  Theorem  1),  nevertheless  we 
ensure  diagonal  dominant  to  give  an  extra  advantage  to  the  block- Jacobi  (BJ) 
preconditioned  CG.  Also,  even  though  we  did  not  prove  the  convergence  of  our 
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Figure  4.1:  Distribution  of  a  banded  linear  system  across  various  processors. 


hybrid  solver  assuming  only  non-singularity  of  the  linear  system,  see  Theorems 
3  and  4,  our  hybrid  solver  proved  to  be  successful  in  handling  such  a  case. 


Table  4.1:  Description  of  test  matrices 


Matrix  Size  Nonzeros  Symm.  Application 

E\:  AMD/G3_circuit  1,585,478  7,660,826  yes  Circuit  Simulation 
£2:  Sandia/ASIC_680k  682,862  4,001,317  no  Circuit  Simulation 


Our  experiments  are  performed  on  an  SGI  Altix  (with  Intel  Itanium  2  pro¬ 
cessors)  at  the  National  Center  for  Supercomputing  Applications  (NCSA)  of 
the  University  of  Illinois  at  Urbana-Champaign.  For  every  run,  we  state  the 
time  <seconds>,  <ff  of  iterations>  of  the  iterative  scheme,  and  the  2-norm  of 
the  residual  <  | j r 1 1 2  =  ||f  —  Ax||2  >.  In  all  experiments,  the  exact  solution  is 
x*  =  e,  with  the  right-hand-side  chosen  as  the  multiplication  of  either  Sj  or  Nj 
with  e.  The  six  experiments  are  performed  on  1,  2,  4,  8,  16  and  32  processors. 
The  stopping  criteria  (s.c.)  for  the  classical  CG  and  BiCGstab  are  chosen  as 
the  relative  residual  |  |r| I2/I |ro 1 1 2  <  10-4, 10-6, 10~10,  while  for  our  hybrid  solver 
the  stopping  criterion  is  chosen  as  1 1 r 1 1 2 / 1 1 i*o 1 1 2  <  10-4.  We  also  terminate  all 
solvers  if  the  number  of  iterations  exceeds  the  size  of  the  system  to  be  solved. 

For  parallel  implementation,  the  banded  matrices  and  the  corresponding 
right-hand-sides  are  distributed  by  blocks  of  rows  across  the  processors  (see 
Fig.  4.1).  BJ  preconditioning  is  chosen  for  CG  and  BiCGstab  as  it  allows  per¬ 
fect  parallelism  in  the  preconditioning  stage.  It  is  worth  mentioning  that  BJ 
preconditioning  of  CG  and  BiCGstab  requires  solving  independent  linear  sys¬ 
tems  each  of  order  equals  to  the  number  of  rows  on  the  current  processor,  while 
preconditioning  our  hybrid  solver  requires  solving  independent  linear  systems 
of  the  much  smaller  order  which  is  equal  to  the  bandwidth. 

It  can  be  seen  from  the  experiments  that  preconditioning  the  balance  sys¬ 
tem  in  DDCG  and  DDBiCGStab  schemes  significantly  reduces  the  number  of 
iterations.  As  mentioned  earlier,  this  a  great  benefit  for  very  little  cost.  Figures 
4.2  and  4.3  show  the  2-norms  of  the  final  residuals  achieved  by  all  four  solvers, 
the  lowest  2-norm  of  the  residuals  was  achieved  by  our  hybrid  solver  and  ScaLa- 
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pack  even  though  our  stopping  criterion  for  solving  the  balance  system  is  only 
I  |r|  I2/I  |ro|  I2  <  10-4.  Requiring  such  low  residuals  from  the  classical  CG  and 
BiCGstab  would  have  consumed  much  more  time  than  illustrated  in  4.4  and 
4.5.  These  two  figures  illustrate  the  excellent  performance  of  our  parallel  hybrid 
solver  which  is  enhanced  as  the  bandwidth  increases. 


Figure  4.2:  Achieved  residual  for  different  methods  and  stopping  criteria  on 
G3-drcuit 


Figure  4.3:  Achieved  residual  for  different  methods  and  stopping  criteria  on 
ASIC-680k 
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Figure  4.4:  Performance  on  8  processors  with  different  stopping  criteria  on 
G3-drcuit 


Figure  4.5:  Performance  on  8  processors  for  different  stopping  criteria  on 
ASIC-680k 


The  overall  behavior  of  the  different  algorithms  is  shown  in  Fig.  4.6  for 
banded  systems  extracted  from  the  symmetric  matrix  G3_circuit,  see  [6].  Here, 
we  show  the  speed  improvement,  or  deterioration,  compared  to  ScaLapack  as  the 
number  of  processors  increase  from  2  to  32.  We  note  that:  (i)  ScaLapack  out¬ 
performs  LAPACK  when  the  number  of  processors  is  4  or  higher,  (ii)  depending 
on  the  effectiveness  of  the  preconditioner,  preconditioned  CG  and  BiCGstab  can 
outperform  ScaLapack,  and  perform  similarly  to  our  un-preconditioned  hybrid 
solver,  and  (iii)  our  preconditioned  hybrid  solver  outperforms  all  the  rest  as  well 
as  obtain  a  much  smaller  final  residual  norm.  These  remarks  are  also  valid  for 
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our  experiments  with  the  non-symmetric  matrices  extracted  from  ASIC_680k 
matrix,  see  [6]. 

Finally,  scalability  of  the  algorithms  for  the  symmetric  and  non-symmetric 
systems  is  shown  in  Fig.  4.7  &  4.8,  respectively. 


Scalapack  (64) 

— -  Scalapack  (128) 

♦  Scalapack  (256) 
Prec.  CG  (64) 

— f-  Prec.  CG  (128) 

— * —  Prec.  CG  (256) 
DDCG  (64) 

— •-  -  DDCG  (128) 

— •—  DDCG  (256) 

Prec.  DDCG  (64) 
— ■  -  Prec.  DDCG  (128) 
— >—  Prec.  DDCG  (256) 


Figure  4.6:  Performance  of  CG  (stopping  criteria<10E-10),  DDCG  and  ScaLa- 
pack  on  G3_circuit 


Prcc.  DDCG  (64) 
Prcc.  DDCG  (128) 
Prec.  DDCG  (256) 


Figure  4.7:  Scalability  of  DDCG  on  G3-drcuit  for  32  processors. 
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— •—  Prec.  DDB'CGStab  (64) 

-  *  -  Prec.  DDB'CGStab  (128; 

—  Prec.  DDB'CGStab  (256; 


Figure  4.8:  Scalability  of  DDBiCGStab  on  ASIC-680k  for  32  processors. 
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4.7  Pseudocode 


1.  Partition  the  linear  system  according  to  (4.3)  and  (4.22)  on  p  processors. 

2.  Compute  ro  =  g  using  (4.40)  by  solving  p  linear  systems  in  the  first  term  of  (4.38). 

3a.  Start  Modified  Krylov  method  (initial  guess  y0  =  0  and  resid.  ro  have  been  computed) 


Modified  Prec.  CG  (see  p.  15  in  [5]) 

Modified  Prec.  BiCGStab  (see  p.  27  in  [5]) 

for  i=l,2,... 

for  i=l,2,... 

solve  M Zj_i  =  rj_i 

Pi-i  =  (if  Pi-i  =  0  failed) 

Pi- 1  =  rJ-lzi-l 

if  i=l  then 

if  *  =  1  then 

P,  =  r»-i 

y 

ll 

N 

o 

else 

else 

Pi-1  =  {pi-l/pi-2){oii-l/u!i-l) 

Pi-1  =  Pi-l/Pi-2 

Pi  =  ri-l  +  Pi—1  (Pi_r  -  Wi- lVj_i) 

end 

end 

q.  =  modified  jnultiply(M,  p4) 

solve  M  p  =  p( 

s 

II 

T' 

■o' 

^2 

Vi  =  modified _multiply(.M,  p) 

y  i  =  yj_i  +  OiPj 

«i  =  pi-i/r1  Vj,  s  =  rj_!  —  apfi 

r i  =  r,_i  -  aiqi 

if  ( 1 1  s  1 1 2  <  e)  then  x*  =  Xj_i  +  a,  p  stop 

check  convergence 

solve  Ms  =  s 

end 

t  =  modified  jnultiply(M,  s) 

U)i  =  tTs/tTt,  Tj  =  S  —  Wjt 
x,  =  Xj_i  +  aiPi  +  u>iS 
check  convergence  (if  u>i  ^  0  proceed) 
end 

3b.  Notice  that  since  ri5  yi  are  composed  of  p  —  1  blocked  components,  the  same  applies 
to  p,  =  [p^  , . . . ,  p?(;l1]T.  Suppose  that  each  of  these  block  components  is  located 
on  a  separate  processor  (one  processor  remains  idle)  then  modified  .multiply  steps  are: 
exchange  (processors  k  =  2, . . .  ,p  —  1  send  pj^  to  k  —  1  and  k  =  1, ...  ,p  —  2  send 
Pfc  to  k  +  1)  to  construct  right-hand-side  in  the  second  term  of  (4.38);  solution  of  the 
system  in  the  second  term  of  (4.38);  computation  of  the  result  using  (4.41),  preceded 
by  communication  where  processors  k  =  2, . . .  ,p  —  1  send  y)  '  to  k  —  1. 

4.  Solve  (4.1)  by  solving  p  independent  linear  systems  (4.4),  where  y  is  known. 
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Chapter  5 


Scalability  of  Parallel 
Programs 

5.1  Introduction 

Faced  with  the  challenge  of  reducing  the  power  consumption  of  conventional 
microprocessors  while  delivering  increased  performance,  microprocessor  ven¬ 
dors  have  leveraged  increasing  transistor  counts  to  deliver  multicore  processors. 
Packaging  multiple,  possibly  simpler  cores,  operating  at  lower  voltage  results 
in  lower  power  consumption,  a  primary  motivation  for  multicore  processors.  A 
byproduct  of  this  is  that  parallelism  becomes  the  primary  driver  for  perfor¬ 
mance.  Indeed,  multicore  designs  with  up  to  16  cores  are  currently  available, 
and  some  with  up  to  80  cores  have  been  prototyped. 

Multicore  processors  bring  tremendous  raw  performance  to  the  desktop. 
Harnessing  this  raw  compute  power  in  real  programs,  ranging  from  desktop 
applications  to  scientific  codes  is  critical  to  their  commercial  and  technological 
viability.  Beyond  the  desktop,  the  packaging  and  power  characteristics  of  mul¬ 
ticore  processors  make  it  possible  to  integrate  a  large  number  of  such  processors 
into  a  parallel  ensemble.  This  has  the  potential  to  realize  the  long-held  vision  of 
truly  scalable  parallel  platforms.  Indeed,  systems  with  close  to  hundred  thou¬ 
sand  cores  are  likely  to  become  available  in  the  foreseeable  future.  The  peak 
performance  of  such  systems  is  anticipated  to  be  several  petaFLOPS. 

The  ability  to  build  scalable  parallel  systems  with  very  large  number  of 
processing  cores  puts  forth  a  number  of  technical  challenges.  With  the  large 
component  counts,  the  reliability  of  such  systems  is  a  major  issue.  System  soft¬ 
ware  support  must  provide  a  reliable  logical  machine  view  to  the  programmer, 
while  supporting  efficient  program  execution.  Programming  models  and  lan¬ 
guages  must  allow  a  programmer  to  express  concurrency  and  interactions  in  a 
manner  that  maximizes  concurrency,  minimizes  overheads,  while  at  the  same 
time  providing  high-level  abstractions  of  the  underlying  hardware.  Runtime 
systems  must  continually  monitor  faults  and  application  performance,  while 
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providing  diagnostic,  prognostic,  and  remedial  services.  While  all  of  these  are 
important  issues,  in  our  belief,  there  is  an  even  more  fundamental  question  -  are 
there  algorithms  that  can  efficiently  scale  to  such  configurations,  given  hardware 
parameters,  limitations,  and  overheads?  The  answer  to  this  question,  and  the 
methodology  by  which  answers  this  question  are  critical. 

Estimating  the  performance  of  a  program  on  a  larger  hardware  configura¬ 
tion  from  known  performance  on  smaller  configurations  and  smaller  problem 
instances  is  generally  referred  to  as  scalability  analysis.  Indeed,  this  is  precisely 
the  problem  we  are  faced  with,  when  we  try  to  assess  applications  and  algo¬ 
rithms  capable  of  efficiently  utilizing  emerging  large-scale  parallel  platforms. 
Scalability  analysis  has  been  well-studied  in  the  literature  -  with  automated, 
analytical,  and  experimental  methods  proposed  and  investigated.  Each  class  of 
methods  has  its  limitations  and  advantages.  Automated  techniques  [16,  40]  work 
on  realizations  of  algorithms  (programs),  as  opposed  to  algorithms  themselves. 
From  this  point  of  view,  their  role  in  guiding  algorithm  development  is  limited  - 
since  they  are  implementation  based.  Furthermore,  automated  techniques  gen¬ 
erally  rely  on  parameterizing  quanta  of  communication  and  computation,  and 
using  or  simulations  (Monte  Carlo,  discrete  event  simulation).  For  this  reason, 
these  methods  have  limited  accuracy,  or  may  themselves  be  computationally 
expensive. 

Experimental  approaches  [3,  13,  27,  28,  79]  to  scalability  analysis  typically 
perform  detailed  quantification  of  the  program  instance  and  underlying  plat¬ 
form,  in  order  to  generate  a  prediction  model.  These  models  can  be  used  to 
predict  performance  beyond  the  observed  envelope.  As  is  the  case  with  au¬ 
tomated  techniques,  since  these  methods  use  realizations  of  algorithms,  their 
ability  to  guide  the  algorithm  development  process  is  limited.  Furthermore, 
because  of  the  sensitive  interaction  between  programs  and  underlying  systems, 
the  prediction  envelope  of  such  methods  is  often  limited. 

Analytical  methods  for  scalability  analysis  rely  on  asymptotic  estimates  of 
basic  metrics,  such  as  parallel  time  and  efficiency.  Where  available,  such  an¬ 
alytical  estimates  hold  the  potential  for  accurate  scalability  predictions,  while 
providing  guidance  for  algorithm  and  architecture  development.  For  example, 
scalability  analysis  predicts  the  necessary  network  bandwidth/compute  power 
in  a  system  to  efficiently  scale  Fast  Fourier  Transforms  (FFTs)  to  large  num¬ 
bers  of  processors.  The  major  drawback  of  these  methods,  of  course,  is  that 
analytical  quantification  of  parallel  performance  is  not  always  easy.  Even  for 
simple  algorithms,  the  presence  of  system  (as  opposed  to  programmer)  con¬ 
trolled  performance  optimizations  (caches  and  associated  replacement  policies 
are  the  simple  examples)  make  accurate  analytical  quantification  difficult. 

In  this  chapter,  we  focus  on  analytical  approaches  to  scalability  analysis.  Our 
choice  is  motivated  by  the  fact  that  in  relatively  early  stages  of  hardware  devel¬ 
opment,  such  studies  can  effectively  guide  hardware  design.  Furthermore,  we 
believe  that  for  many  applications,  parallel  algorithms  scaling  to  very  large  con¬ 
figurations  may  not  be  the  same  as  those  that  yield  high  efficiencies  on  smaller 
configurations.  For  these  desired  predictive  properties,  analytical  modeling  is 
critical. 
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There  are  a  number  of  scaling  scenarios,  which  may  be  suited  to  various  ap¬ 
plication  scenarios.  In  a  large  class  of  applications,  the  available  memory  limits 
the  size  of  problem  instance  being  solved.  The  question  here  is,  given  constraints 
on  memory  size,  how  does  an  algorithm  perform?  In  other  applications,  for  ex¬ 
ample,  weather  forecasting,  there  are  definite  deadlines  on  task  completion.  The 
question  here  becomes  one  of,  what  is  the  largest  problem  one  can  solve,  given 
constraints  on  execution  time,  on  a  specific  machine  configuration.  In  yet  other 
applications,  the  emphasis  may  be  on  efficiency.  Here,  one  asks  the  question, 
what  is  the  smallest  problem  instance  I  must  solve  on  a  machine  so  as  to  achieve 
desired  execution  efficiency.  Each  of  these  scenarios  is  quantified  by  a  different 
scalability  metric.  We  describe  these  metrics  and  their  application  to  specific 
algorithms. 

The  rest  of  this  chapter  is  organized  as  follows:  in  Section  5.2,  we  describe 
basic  metrics  of  parallel  performance,  in  Section  5.3,  we  describe  various  scala¬ 
bility  metrics  and  how  they  can  be  applied  to  emerging  parallel  platforms.  In 
Section  5.4,  we  discuss  alternate  parallel  paradigms  of  task  mapping  for  com¬ 
posite  algorithms  and  how  our  scalability  analysis  applies  to  these  scenarios. 

5.2  Metrics  of  Parallel  Performance 

Throughout  this  chapter,  we  refer  to  a  single  processing  core  as  a  processor. 
Where  we  talk  of  a  chip  with  multiple  cores,  we  refer  to  it  as  a  chip  multipro¬ 
cessor,  or  a  multicore  processor.  We  assume  that  a  parallel  processor  (or  ensem¬ 
ble)  consists  of  p  identical  processing  cores.  The  cores  are  connected  through 
an  interconnect.  Exchanging  a  message  of  size  m  between  cores  incurs  a  time 
of  ts  +  twm.  Here,  ts  corresponds  to  the  network  latency  and  tw  the  per-word 
transfer  time  determined  by  the  network  bandwidth.  The  communication  model 
used  here  is  a  simplification,  since  it  does  not  account  for  network  congestion, 
communication  patterns,  overlapped  access,  and  interconnect  topology,  among 
others.  We  define  a  parallel  system  as  a  combination  of  a  parallel  program  and 
a  parallel  machine.  We  assume  that  the  serial  time  of  execution  of  an  algorithm 
Ts  is  composed  of  W  basic  operations. 

Perhaps  the  simplest  and  most  intuitive  metric  of  parallel  performance  is 
the  parallel  runtime,  Tp.  This  is  the  time  elapsed  between  initiation  of  the 
parallel  program  at  the  first  processor  to  the  completion  of  the  program  at  the 
last  processor.  Indeed,  an  analytical  expression  for  parallel  time  captures  all 
the  performance  parameters  of  a  parallel  algorithm.  The  problem  with  parallel 
runtime  is  that  it  does  not  account  for  the  resources  used  to  achieve  the  execution 
time.  Specifically,  if  one  were  to  indicate  that  the  parallel  runtime  of  a  program, 
which  took  10s  on  a  serial  processor,  is  2  seconds,  we  would  have  no  way  of 
knowing  whether  the  parallel  program  (and  associated  algorithm)  performs  well 
or  not.  The  obvious  question  w.r.t.  parallel  time  is,  how  much  lower  it  is 
compared  to  its  serial  counterpart.  This  speedup,  S  is  defined  as  the  ratio  of 
the  serial  time  Ts  and  the  parallel  time  Tp.  Mathematically,  S  =  Ts/Tp.  For 
the  example  above,  the  speedup  is  10/2  =  5. 
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While  we  have  a  sense  of  how  much  faster  our  program  is  compared  to  its 
serial  counterpart,  we  still  cannot  estimate  its  performance,  since  we  do  not 
know  how  many  resources  were  consumed  to  achieve  this  speedup.  Specifically, 
if  one  were  told  that  a  parallel  program  achieved  a  speedup  of  5,  could  we  say 
anything  about  the  performance  of  the  program?  For  this  reason,  the  speedup 
can  be  normalized  with  respect  to  the  number  of  processors  p  to  compute  ef¬ 
ficiency.  Formally,  efficiency  E  =  S/p.  In  the  example  above,  if  the  speedup 
of  5  is  achieved  using  8  processors,  the  efficiency  is  5/8  or  0.625  (or  62.5%).  If 
the  same  speedup  is  achieved  using  16  processors,  the  corresponding  efficiency 
is  0.312  (or  31.2%).  Clearly,  the  algorithm  can  be  said  to  perform  better  in  the 
former  case. 

Efficiency,  by  itself,  is  not  a  complete  indicator  of  parallel  performance  ei¬ 
ther.  A  given  problem  can  be  solved  using  several  possible  serial  algorithms, 
which  may  be  more  or  less  amenable  to  parallel  execution.  For  example,  given 
a  sparse  linear  system,  we  can  solve  it  using  simple  Jacobi  iterations.  These 
are  easily  parallelizable  and  mostly  involve  near-neighbor  communication  (other 
than  global  dot  products).  On  the  other  hand,  the  same  system  can  be  solved  us¬ 
ing  a  more  powerful  solver  such  as  the  Generalized  Minimum  Residual  method, 
GMRES,  see  Yousef  Saad,  Iterative  Methods  for  Sparse  Linear  Systems,  second 
edition,  SIAM,  2003,  with  an  approximate  inverse  preconditioner.  This  method 
solves  the  problem  much  faster  in  the  serial  context,  but  is  less  amenable  to  par¬ 
allelism,  compared  to  the  first  algorithm.  In  such  cases,  one  must  be  cautious 
relying  simply  on  efficiency  as  a  metric  for  performance. 

To  account  for  these  issues,  vendors  and  users  sometimes  rely  on  total  aggre¬ 
gate  FLOPS  as  a  metric.  Indeed,  when  we  refer  to  a  platform  as  being  capable 
of  ten  petaFLOPS,  it  is  in  the  context  of  some  program  and  input,  or  the 
absolute  peak  performance,  which  may,  or  may  not  be  achievable  by  any  pro¬ 
gram.  The  most  popular  choice  of  a  benchmark  in  this  context  is  the  Linpack 
benchmark.  The  Linpack  benchmark  has  favorable  data  reuse  and  computa¬ 
tion/communication  characteristics,  and  therefore  yields  parallel  performance 
closer  to  peak  for  typical  parallel  platforms.  However,  this  begs  the  obvious 
question  of  what,  if  any,  this  implies  for  other  applications  that  may  not  have 
the  same  data  reuse  characteristics.  Indeed,  Linpack  numbers,  referring  to  dense 
matrix  operations  are  rarely  indicative  of  machine  performance  on  typical  appli¬ 
cations  that  have  sparse  kernels  (PDE  solvers),  molecular  dynamics  type  sparse 
interaction  potentials,  access  workloads  of  large-scale  data  analysis  kernels,  or 
commercial  server  workloads.  It  is  important  to  note  that  the  oft-used  sparse 
kernels  often  yield  as  low  as  5%  to  10%  of  peak  performance  on  conventional 
processors  because  of  limited  data  reuse. 

In  addition  to  the  shortcomings  mentioned  above,  basic  metrics  do  not  ex¬ 
plicitly  target  scaling.  Specifically,  if  the  parallel  time  of  program  (algorithm) 
A  for  solving  a  problem  is  lower  than  that  of  algorithm  B  for  solving  the  same 
problem,  what  does  it  say  if  the  problem  size  is  changed,  the  number  of  pro¬ 
cessors  is  increased,  or  if  any  of  the  computation/  communication  parameters 
are  varied.  It  turns  out  that  a  single  sample  point  in  the  multidimensional 
performance  space  of  parallel  programs  says  nothing  about  the  performance  at 
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other  points.  This  provides  strong  motivation  for  the  development  of  scalability 
metrics. 


5.3  Metrics  for  Scalability  Analysis 

We  motivate  scalability  analysis  through  two  examples,  at  two  diverse  ends  of 
the  spectrum.  The  first  example  relates  to  emerging  large-scale  platforms,  and 
the  second,  to  emerging  scalable  multicore  desktop  processors. 

An  important  challenge  to  the  parallel  computing  community  currently  is 
to  develop  a  core  set  of  algorithms  and  applications  that  can  utilize  emerging 
petascale  computers.  The  number  of  cores  in  these  platforms  will  be  in  the 
range  of  100,000,  in  an  energy  efficient  configuration.  A  number  of  important 
questions  are  posed  in  this  context  -  (i)  which,  if  any,  of  the  current  applications 
and  algorithms  are  likely  to  be  able  to  scale  to  such  configurations,  (ii)  what 
fundamentally  new  algorithms  will  be  required  to  scale  to  very  large  machine 
configurations,  (iii)  what  are  realizable  architectural  features  that  will  enhance 
scaling  characteristics  of  wide  classes  of  algorithms,  (iv)  what  are  the  memory 
and  I/O  requirements  of  such  platforms,  and  (v)  how  can  time-critical  applica¬ 
tions,  such  as  weather  forecasting,  effectively  utilize  these  platforms.  Scalability 
analysis  can  answer  many  of  these  questions,  while  guiding  algorithm  and  ar¬ 
chitecture  design. 

At  the  other  end  of  the  spectrum,  desktop  software  vendors  are  grappling 
with  questions  of  how  to  develop  efficient  software  with  meaningful  develop¬ 
ment  and  deployment  cycles.  Specifically,  will  software,  ranging  from  operating 
systems  (Windows  Vista,  Linux,  etc.)  to  desktop  applications  (word  process¬ 
ing,  media  applications)  be  able  to  use  64  cores  or  beyond,  likely  to  become 
available  in  high-end  desktops  and  engineering  workstations  in  a  5-year  time- 
frame.  The  difficulty  associated  with  this  is  that  such  platforms  do  not  exist 
currently,  therefore,  one  must  design  algorithms  and  software  that  can  scale  to 
future  platforms.  Again,  scalability  analysis  can  guide  algorithm  and  software 
development  in  fundamental  ways. 

The  general  theme  of  these  two  motivating  examples  is  that  often,  programs 
are  designed  and  tested  for  smaller  problems  on  fewer  processing  elements.  How¬ 
ever,  the  real  problems  these  programs  are  intended  to  solve  are  much  larger, 
and  the  machines  contain  larger  number  of  processing  elements.  Whereas  code 
development  is  simplified  by  using  scaled-down  versions  of  the  machine  and  the 
problem,  their  performance  and  correctness  is  much  more  difficult  to  establish 
based  on  scaled-down  systems. 

Why  is  performance  extrapolation  so  difficult?  Consider  three  parallel 
algorithms  for  computing  an  n-point  Fast  Fourier  Transform  (FFT)  on  64  pro¬ 
cessing  cores.  Figure  5.1  illustrates  speedup  as  the  value  of  n  is  increased  to 
18K.  Keeping  the  number  of  processing  cores  constant,  at  smaller  values  of  n, 
one  would  infer  from  observed  speedups  that  binary  exchange  and  3-D  transpose 
algorithms  are  the  best.  However,  as  the  problem  is  scaled  up  to  18  K  points  and 
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n  - - 

Figure  5.1:  A  comparison  of  the  speedups  obtained  by  the  binary-exchange,  2-D 
transpose  and  3-D  transpose  algorithms  on  6f  cores  with  tc  =  2 ns,  tw  =  4ns, 
ts  =  25 ns,  and  th  =  2 ns. 


beyond,  it  is  evident  from  Figure  5.1  that  the  2-D  transpose  algorithm  yields 
best  speedup  [39]. 

Similar  results  can  be  shown  relating  to  the  variation  in  number  of  processing 
cores  as  the  problem  size  is  held  constant.  Unfortunately,  such  parallel  perfor¬ 
mance  traces  are  the  norm  as  opposed  to  the  exception,  making  performance 
prediction  based  on  limited  observed  data  very  difficult. 

5.3.1  Scaling  characteristics  of  parallel  programs 

The  parallel  runtime  of  a  program,  summed  across  all  processing  cores,  ( pTp ), 
consists  of  essential  computation  which  would  be  performed  by  its  serial  counter¬ 
part  ( Ts ),  and  overhead  incurred  in  parallelization  ( Ta ).  Note  that  we  use  Ta  to 
denote  the  total  (or  parallel)  overhead  summed  across  all  processors.  Formally, 
we  write  this  as: 

pTp  =  Ts  +  Ta  (5.1) 

We  also  know  that  the  efficiency  of  a  parallel  program  can  be  written  as: 

F=S  =  1±_ 

P  pTp 

Using  the  expression  for  parallel  overhead  (Equation  5.1),  we  can  rewrite  this 
expression  as 

E=- (5-2) 

1  +  Ts 
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The  total  overhead  function  Ta  is  an  increasing  function  of  p  [31,  32].  This 
is  because  every  program  must  contain  some  serial  component.  If  this  serial 
component  of  the  program  takes  time  tseriai ,  then  during  this  time  all  the  other 
processing  elements  must  be  idle.  This  corresponds  to  a  total  overhead  function 
of  (p  —  1)  x  tseriai ■  Therefore,  the  total  overhead  function  T0  grows  at  least 
linearly  with  p.  Furthermore,  due  to  communication,  idling,  and  excess  compu¬ 
tation,  this  function  may  grow  super-linear ly  in  the  number  of  processing  cores. 
Equation  5.2  gives  us  several  interesting  insights  into  the  scaling  of  parallel  pro¬ 
grams.  First,  for  a  given  problem  size  (i.e.,  the  value  of  Ts  remains  constant), 
as  we  increase  the  number  of  processing  elements,  Ta  increases.  In  this  scenario, 
it  is  clear  from  Equation  5.2  that  the  overall  efficiency  of  the  parallel  program 
goes  down.  This  characteristic  of  decreasing  efficiency  with  increasing  number 
of  processing  cores  for  a  given  problem  size  is  common  to  all  parallel  programs. 
Often,  in  our  quest  for  ever  more  powerful  machines,  this  fundamental  insight 
is  lost  -  namely  that  the  same  problem  instances  that  we  solve  on  todays  com¬ 
puters  are  unlikely  to  yield  meaningful  performance  on  emerging  highly-parallel 
platforms. 

Let  us  investigate  the  effect  of  increasing  the  problem  size  keeping  the  num¬ 
ber  of  processing  cores  constant.  We  know  that  the  total  overhead  function  Ta  is 
a  function  of  both  problem  size  Ts  and  the  number  of  processing  elements  p.  In 
many  cases,  Ta  grows  sub-linearly  with  respect  to  Ts.  In  such  cases,  we  can  see 
that  efficiency  increases  if  the  problem  size  is  increased  keeping  the  number  of 
processing  elements  constant.  Indeed,  when  demonstrating  parallel  performance 
on  large  machine  configurations,  many  researchers  assume  that  the  problem  is 
scaled  in  proportion  to  the  number  of  processors.  This  notion  of  experimentally 
scaling  problem  size  to  maintain  constant  computation  per  processor  is  referred 
to  as  scaled  speedup  [41,  42,  43],  and  is  discussed  in  Section  5.3.7.  For  such 
algorithms,  it  should  be  possible  to  keep  the  efficiency  fixed  by  increasing  both 
the  size  of  the  problem  and  the  number  of  processing  elements  simultaneously. 
This  ability  to  maintain  efficiency  constant  by  simultaneously  increasing  the 
number  of  processing  cores  and  the  size  of  the  problem  is  essential  to  utilizing 
scalable  parallel  platforms.  Indeed,  a  number  of  parallel  systems  exhibit  such 
characteristics.  We  call  such  systems  scalable  parallel  systems.  The  scalability  of 
a  parallel  system  is  a  measure  of  its  capacity  to  increase  performance  (speedup) 
in  proportion  to  the  number  of  processing  cores.  It  reflects  a  parallel  system’s 
ability  to  utilize  increasing  processing  resources  effectively. 

5.3.2  The  isoefiiciency  metric  of  scalability 

We  summarize  our  discussion  in  the  section  above  in  the  following  two  observa¬ 
tions: 

1.  For  a  given  problem  size,  as  we  increase  the  number  of  processing  cores, 
the  overall  efficiency  of  the  parallel  system  goes  down.  This  phenomenon 
is  common  to  all  parallel  systems. 

2.  In  many  cases,  the  efficiency  of  a  parallel  system  increases  if  the  problem 
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Figure  5.2:  Variation  of  efficiency:  (a)  as  the  number  of  processing  elements  is 
in  creased  for  a  given  problem  size;  and  (b)  as  the  problem  size  is  increased  for 
a  given  number  of  processing  elements.  The  phenomenon  illustrated  in  graph 
(b)  is  not  common  to  all  parallel  systems. 


size  is  increased  while  keeping  the  number  of  processing  cores  constant. 

These  two  phenomena  are  illustrated  in  Figure  5.2(a)  and  (b),  respectively. 
Following  from  these  two  observations,  we  define  a  scalable  parallel  system  as 
one  in  which  the  efficiency  can  be  kept  constant  as  the  number  of  processing 
elements  is  increased,  provided  that  the  problem  size  is  also  increased. 

An  important  question  that  follows  naturally  is,  what  is  the  rate  at  which 
the  problem  size  must  be  increased  in  order  to  keep  efficiency  constant.  This 
rate  is  critical  for  a  number  of  reasons  -  primarily,  this  rate  determines  the  rate 
at  which  the  total  memory  in  the  system  must  be  increased.  If  problem  size  is 
linear  in  memory  size,  as  is  the  case  for  a  number  of  algorithms  (applications), 
and  the  rate  of  increase  in  problem  size  is  super-linear,  this  implies  that  the  total 
memory  in  the  parallel  machine  must  increase  super-linearly.  This  is  a  critical 
observation.  Conversely,  if  the  increase  in  system  memory  is  sublinear  with 
respect  to  the  number  of  cores,  one  can  trivially  show  that  the  parallel  efficiency 
will  decline.  The  relation  between  problem  size,  memory  size,  and  efficiency,  is  a 
critical  determinant  of  the  scaling  characteristics  of  parallel  systems.  A  second 
implication  of  increasing  problem  size  is  that  a  super-linear  increase  results  in  an 
increasing  (parallel)  time  to  solution.  For  applications  with  constraints  on  time 
to  solution  (a  classic  example  is  in  short-term  weather  forecasting),  this  may 
eventually  limit  growth  in  problem  size.  We  discuss  these  three  issues,  namely, 
increase  in  problem  size  to  maintain  efficiency,  impact  of  memory  constraints, 
and  impact  of  parallel  solution  time  constraints  separately. 

For  different  parallel  systems,  the  problem  size  must  increase  at  different 
rates  in  order  to  maintain  a  fixed  efficiency  as  the  number  of  processing  elements 
is  increased.  This  rate  is  determined  by  the  overheads  incurred  in  parallelization. 
A  lower  rate  is  more  desirable  than  a  higher  growth  rate  in  problem  size.  This 
is  because  it  allows  us  to  extract  good  performance  on  smaller  problems,  and 
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consequently,  improved  performance  on  even  larger  problem  instances.  We  now 
investigate  metrics  for  quantitatively  determining  the  degree  of  scalability  of  a 
parallel  system.  We  start  by  formally  defining  the  Problem  Size. 

Problem  Size  When  analyzing  parallel  systems,  we  frequently  encounter  the 
notion  of  the  size  of  the  problem  being  solved.  Thus  far,  we  have  used  the  term 
problem  size  informally,  without  giving  a  precise  definition.  A  naive  way  to 
express  problem  size  is  as  a  parameter  of  the  input  size;  for  instance,  n  in  case 
of  a  matrix  operation  involving  n  x  n  matrices.  A  drawback  of  this  definition 
is  that  the  interpretation  of  problem  size  changes  from  one  problem  to  another. 
For  example,  doubling  the  input  size  results  in  an  eight-fold  increase  in  the 
execution  time  for  matrix  multiplication  and  a  four-fold  increase  for  matrix 
addition  (assuming  that  the  conventional  Q(n3)  algorithm  is  the  best  matrix 
multiplication  algorithm,  and  disregarding  more  complicated  algorithms  with 
better  asymptotic  complexities).  Using  memory  size  as  a  measure  of  problem 
size  has  a  similar  drawback.  In  this  case,  the  memory  size  associated  with  the 
dense  matrices  is  @(n2),  which  is  also  the  time  for,  say,  matrix  addition  and 
matrix- vector  multiplication,  but  not  for  matrix-matrix  multiplication. 

A  consistent  definition  of  the  size  or  the  magnitude  of  the  problem  should  be 
such  that,  regardless  of  the  problem,  doubling  the  problem  size  always  means 
performing  twice  the  amount  of  computation.  Therefore,  we  choose  to  express 
problem  size  in  terms  of  the  total  number  of  basic  operations  required  to  solve 
the  problem.  By  this  definition,  the  problem  size  is  Q(n3)  for  n  x  n  matrix 
multiplication  (assuming  the  conventional  algorithm)  and  0(n2)  for  nxn  matrix 
addition.  In  order  to  keep  it  unique  for  a  given  problem,  we  define  problem  size 
as  the  number  of  basic  computation  steps  in  the  best  sequential  algorithm  to 
solve  the  problem  on  a  single  processing  core.  Since  it  is  defined  in  terms  of 
sequential  time  complexity,  problem  size  W  is  a  function  of  the  size  of  the  input. 

Without  loss  of  generality,  we  often  assume  that  it  takes  unit  time  to  perform 
one  basic  computation  step  of  an  algorithm.  This  assumption  does  not  impact 
the  analysis  of  any  parallel  system  because  the  other  hardware-related  constants, 
such  as  message  startup  time,  per-word  transfer  time,  and  per-hop  time,  can  be 
normalized  with  respect  to  the  time  taken  by  a  basic  computation  step.  With 
this  assumption,  the  problem  size  W  is  equal  to  the  serial  runtime  Ts  of  the 
fastest  known  algorithm  to  solve  the  problem  on  a  sequential  computer. 

The  isoefficiency  function 

Parallel  execution  time  can  be  expressed  as  a  function  of  problem  size,  overhead 
function,  and  the  number  of  processing  elements.  We  can  write  parallel  runtime 
as: 


TP 


W  +  T0(W,p) 
P 


(5.3) 
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The  resulting  expression  for  speedup  is 


S 


W 

% 

Wp 

W  +  T0{WlP)' 


Finally,  we  write  the  expression  for  efficiency  as 


P 

W 

W  +  Ta{W,p) 

1 

1  +  T0{W,p)/W 


(5.4) 


(5.5) 


In  Equation  5.5,  if  the  problem  size  is  kept  constant  and  p  is  increased,  the 
efficiency  decreases  because  the  total  overhead  T0  increases  with  p.  If  W  is 
increased  keeping  the  number  of  processing  elements  fixed,  then  for  scalable 
parallel  systems,  the  efficiency  increases.  This  is  because  Ta  grows  slower  than 
0(W)  for  a  fixed  p.  For  these  parallel  systems,  efficiency  can  be  maintained  at 
a  desired  value  (between  0  and  1)  for  increasing  p,  provided  W  is  also  increased. 

For  different  parallel  systems,  W  must  be  increased  at  different  rates  with 
respect  to  p  in  order  to  maintain  a  fixed  efficiency.  For  instance,  in  some  cases, 
W  might  need  to  grow  as  an  exponential  function  of  p  to  keep  the  efficiency 
from  dropping  as  p  increases.  Such  parallel  systems  are  poorly  scalable.  The 
reason  is  that  on  these  parallel  systems  it  is  difficult  to  obtain  good  speedups 
for  a  large  number  of  processing  elements  unless  the  problem  size  is  enormous. 
On  the  other  hand,  if  W  needs  to  grow  only  linearly  with  respect  to  p,  then  the 
parallel  system  is  highly  scalable.  That  is  because  it  can  easily  deliver  speedups 
proportional  to  the  number  of  processing  elements  for  reasonable  problem  sizes. 

For  scalable  parallel  systems,  efficiency  can  be  maintained  at  a  fixed  value 
(between  0  and  1)  if  the  ratio  Ta/W  in  Equation  5.5  is  maintained  at  a  constant 
value.  For  a  desired  value  E  of  efficiency, 

1 

l  +  T0(W,p)/W' 

1  -E 
E  ’ 

YTeTo(W,p).  (5.6) 


E  = 

T0(W,p) 

W 

w  = 


Let  K  =  E/(l  —  E)  be  a  constant  depending  on  the  efficiency  to  be  maintained. 
Since  Ta  is  a  function  of  W  and  p,  Equation  5.6  can  be  rewritten  as 


W  =  KT0(W,p). 


(5.7) 
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From  Equation  5.7,  the  problem  size  W  can  usually  be  obtained  as  a  func¬ 
tion  of  p  by  algebraic  manipulations.  This  function  dictates  the  growth  rate  of 
W  required  to  keep  the  efficiency  fixed  as  p  increases.  We  call  this  function  the 
isoefficiency  function  of  the  parallel  system.  The  isoefficiency  function  deter¬ 
mines  the  ease  with  which  a  parallel  system  can  maintain  constant  efficiency, 
and  hence  achieve  speedups  increasing  in  proportion  to  the  number  of  process¬ 
ing  elements.  A  small  isoefficiency  function  means  that  small  increments  in  the 
problem  size  are  sufficient  for  the  efficient  utilization  of  an  increasing  number 
of  processing  elements,  indicating  that  the  parallel  system  is  highly  scalable. 
However,  a  large  isoefficiency  function  indicates  a  poorly  scalable  parallel  sys¬ 
tem.  The  isoefficiency  function  does  not  exist  for  unscalable  parallel  systems, 
because  in  such  systems  the  efficiency  cannot  be  kept  at  any  constant  value  as 
p  increases,  no  matter  how  fast  the  problem  size  is  increased. 

Isoefficiency  function  of  computing  an  n-point  FFT.  One  parallel  for¬ 
mulation  of  FFT,  called  the  Binary  Exchange  Algorithm,  computes  an  ?r-point 
FFT  on  a  p  processor  machine  (with  0(p),  or  full  bisection  bandwidth)  in  time: 

Tl  Tl 

Tp  =  tc  —  log  n  +  ts  log  p  +  t.w  —  log  p 

p  p 

Recall  here  that  tc  refers  to  unit  computation,  ts  to  the  message  startup  time, 
and  tw  to  the  per-word  message  transfer  time.  Recall  also  that  the  problem 
size,  W,  for  an  n-point  FFT  is  given  by: 

W  =  ?rlogn 

The  efficiency,  E,  of  this  computation  can  be  written  as: 

E  _  Ts_  _  _ tcn  log  n _ 

pTp  tc  j  log  n  +  ts  log  p  +  twj  log  p 

Through  simple  algebraic  manipulations,  we  can  rewrite  the  above  expression 
as: 

tsplogp  twlogp  _  1  -  E 
tcn  log  n  tc  log  n  E 

If  efficiency  E  is  held  constant,  the  right  hand  side  is  constant.  The  equality 
can  be  forced  in  the  asymptotic  sense  by  forcing  each  of  the  terms  in  the  left 
hand  side  to  be  constant,  individually.  For  the  first  term,  we  have, 

tsplogp  _ 
tcn  log  n  K ' 

for  constant  K .  This  implies  that  the  corresponding  isoefffciency  term  is  given 
by  IT  «  nlogn  «  plogp.  For  the  second  term,  we  have, 

twlogp  _  J_ 
tc  log  n  I\  1 
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logn  =  K -A  log p, 
tc 

n  =  pKt™/tc, 

W  =  nlogn  =  K—pKtw/tc. 
tC 

In  addition  to  these  two  isoefficiency  terms  resulting  from  communication 
overhead,  since  the  Binary  Exchange  algorithm  can  only  use  p  =  n  processors, 
there  is  an  isoefficiency  term  resulting  from  concurrency.  This  term  is  given  by 
W  =  n  log  n  =  p  log  p. 

The  combined  isoefficiency  from  the  three  terms  is  the  dominant  of  the  three. 
If  Ktw  <  tc  (which  happens  when  we  have  very  high  bandwidth  across  the 
processing  cores),  the  concurrency  and  startup  terms  dominate,  and  we  have  an 
isoefficiency  of  p\ogp.  Otherwise,  the  isoefficiency  is  exponential,  and  given  by 
W  w  pKt^/tc_  This  observation  demonstrates  the  power  of  isoefficiency  analysis 
in  quantifying  both  scalability  of  the  algorithm,  as  well  as  desirable  features 
of  the  underlying  platform.  Specifically,  if  we  would  like  to  use  the  Binary 
Exchange  algorithm  for  FFT  on  large  platforms,  the  balance  of  the  machine 
should  be  such  that  Ktw  <  tc  (i.e.,  it  must  have  sufficient  per-processor  bisection 
bandwidth) . 

The  isoefficiency  function  [38]  captures,  in  a  single  expression,  the  charac¬ 
teristics  of  a  parallel  algorithm  as  well  as  the  parallel  platform  on  which  it  is 
implemented.  Isoefficiency  analysis  enables  us  to  predict  performance  on  larger 
number  of  processing  cores  from  observed  performance  on  a  smaller  number  of 
cores.  Detailed  isoefficiency  analysis  can  also  be  used  to  study  the  behavior  of 
a  parallel  system  with  respect  to  changes  in  hardware  parameters  such  as  the 
speed  of  processing  cores  and  communication  channels.  In  many  cases,  isoef- 
ficiency  analysis  can  be  used  even  for  parallel  algorithms  for  which  we  cannot 
derive  a  value  of  parallel  runtime,  by  directly  relating  the  total  overhead  to  the 
problem  size  and  number  of  processors  [39] . 

5.3.3  Cost-optimality  and  the  isoefficiency  function 

A  parallel  system  is  cost-optimal  if  the  product  of  the  number  of  processing 
cores  and  the  parallel  execution  time  is  proportional  to  the  execution  time  of 
the  fastest  known  sequential  algorithm  on  a  single  processing  element.  In  other 
words,  a  parallel  system  is  cost-optimal  if  and  only  if 

pTp  =  0(ffi).  (5.8) 

Substituting  the  expression  for  Tp  from  the  right-hand  side  of  Equation  5.3,  we 
get  the  following: 


W  +  T0{W,p)  = 
T0(W,p )  = 
W  = 
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0(W) 

0{W) 

O  (To(W,p)) 


(5.9) 

(5.10) 


Equations  5.9  and  5.10  suggest  that  a  parallel  system  is  cost-optimal  if  and 
only  if  its  overhead  function  does  not  asymptotically  exceed  the  problem  size. 
This  is  very  similar  to  the  condition  specified  by  Equation  5.7  for  maintaining  a 
fixed  efficiency  while  increasing  the  number  of  processing  elements  in  a  parallel 
system.  If  Equation  5.7  yields  an  isoefficiency  function  f(p),  then  it  follows 
from  Equation  5.10  that  the  relation  W  =  must  be  satisfied  to  ensure 

the  cost-optimality  of  a  parallel  system  as  it  is  scaled  up. 

5.3.4  A  lower  bound  on  the  isoefficiency  function 

We  discussed  earlier  that  a  smaller  isoefficiency  function  indicates  higher  seal- 
ability.  Accordingly,  an  ideally-scalable  parallel  system  must  have  the  lowest 
possible  isoefficiency  function.  For  a  problem  consisting  of  W  units  of  work, 
no  more  than  W  processing  elements  can  be  used  cost-optimally;  additional 
processing  cores  will  be  idle.  If  the  problem  size  grows  at  a  rate  slower  than 
0(p)  as  the  number  of  processing  elements  increases,  the  number  of  process¬ 
ing  cores  will  eventually  exceed  W.  Even  for  an  ideal  parallel  system  with  no 
communication,  or  other  overhead,  the  efficiency  will  drop  because  processing 
elements  added  beyond  p  =  W  will  be  idle.  Thus,  asymptotically,  the  problem 
size  must  increase  at  least  as  fast  as  0(p)  to  maintain  fixed  efficiency;  hence, 
O  (p)  is  the  asymptotic  lower  bound  on  the  isoefficiency  function.  It  follows  that 
the  isoefficiency  function  of  an  ideally  scalable  parallel  system  is  0(p). 

This  trivial  lower  bound  implies  that  the  problem  size  must  grow  at-least  lin¬ 
early,  or,  consequently,  the  problem  size  per  processor  must  be  at-least  constant. 
For  problems  where  memory  size  and  problem  size,  W  are  linearly  related,  the 
memory  per  processing  core  must  increase  linearly  to  maintain  constant  effi¬ 
ciency. 

5.3.5  The  degree  of  Concurrency  and  the  Isoefficiency 
Function 

A  lower  bound  of  fi(p)  is  imposed  on  the  isoefficiency  function  of  a  parallel 
system  by  the  number  of  operations  that  can  be  performed  concurrently.  The 
maximum  number  of  tasks  that  can  be  executed  simultaneously  at  any  time  in  a 
parallel  algorithm  is  called  its  degree  of  concurrency.  The  degree  of  concurrency 
is  a  measure  of  the  number  of  operations  that  an  algorithm  can  perform  in 
parallel  for  a  problem  of  size  W;  it  is  independent  of  the  parallel  architecture. 
If  C{W)  is  the  degree  of  concurrency  of  a  parallel  algorithm,  then  for  a  problem 
of  size  W,  no  more  than  C(W)  processing  elements  can  be  employed  effectively. 

Effect  of  concurrency  on  isoefficiency  function.  Consider  solving  a  sys¬ 
tem  of  n  equations  in  n  variables  by  using  Gaussian  elimination.  The  total 
amount  of  computation  is  @(n3).  But  the  n  variables  must  be  eliminated  one 
after  the  other  (assuming  pivoting),  and  eliminating  each  variable  requires  0(?i2) 
computations.  Thus,  at  most  0(n2)  processing  elements  can  be  kept  busy  at 
any  time.  Since  W  =  0(rc3)  for  this  problem,  the  degree  of  concurrency  C(W) 
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is  ©(W2/3)  and  at  most  ©(IF2/3)  processing  elements  can  be  used  efficiently. 
On  the  other  hand,  given  p  processing  cores,  the  problem  size  should  be  at  least 
0(p3/2)  to  use  them  all.  Thus,  the  isoefficiency  function  of  this  computation 
due  to  concurrency  is  0(p3/2). 

The  isoefficiency  function  due  to  concurrency  is  optimal  (that  is,  0 (p))  only 
if  the  degree  of  concurrency  of  the  parallel  algorithm  is  ©(IF).  If  the  degree  of 
concurrency  of  an  algorithm  is  less  than  ©(IF),  then  the  isoefficiency  function 
due  to  concurrency  is  worse  (that  is,  greater)  than  0(p).  In  such  cases,  the 
overall  isoefficiency  function  of  a  parallel  system  is  given  by  the  maximum  of  the 
isoefficiency  functions  due  to  concurrency,  communication,  and  other  overheads. 

5.3.6  Scaling  properties  and  parallel  benchmarks 

At  this  point,  we  can  further  investigate  why  dense  linear  algebra  kernels  are 
frequently  used  as  benchmarks  for  parallel  systems.  Consider  the  example  of 
matrix  multiplication  -  the  first  thing  to  note  is  that  it  has  significant  data  reuse, 
even  in  the  serial  sense  (0(n3)  operations  on  0(n2)  data).  The  parallel  runtime 
of  a  commonly  used  matrix  multiplication  algorithm  (Cannon’s  algorithm)  is 
given  by: 

n3  n 2 

Tp  = - b  2 ^fpts  +  2tw—p 

P  Vp 

The  corresponding  isoefficiency  is  given  by  IF  =  p15.  The  favorable  memory  to 
FLOPS  ratio  of  matrix  multiplication,  combined  with  the  relatively  favorable 
isoefficiency  function  makes  this  an  ideal  parallel  benchmark.  Specifically,  if 
M  ss  n2  represents  the  memory  requirement,  since  IF  ss  n3,  we  have  IF  ss  M1'5. 
Substituting  in  the  expression  for  isoefficiency,  we  have,  IF  ss  M1'5  «  p1-5. 
From  this,  we  can  see  that  we  can  keep  efficiency  constant  with  M  w  p.  In 
other  words,  the  increase  in  problem  size  to  hold  efficiency  constant  requires 
only  a  linear  increase  in  total  memory,  or  constant  memory  per  processor. 

The  relatively  benign  memory  requirement  and  their  high  processor  utiliza¬ 
tion  makes  dense  kernels,  such  as  matrix-matrix  multiplication  and  linear  direct 
solvers  ideal  benchmarks  for  large-scale  parallel  platforms.  It  is  important  to 
recognize  the  favorable  characteristics  of  these  benchmarks,  since  a  number  of 
real  applications,  for  example,  sparse  solvers  and  molecular  dynamics  methods, 
do  not  have  significant  data  reuse  and  the  underlying  computation  is  linear  in 
memory  requirement. 

5.3.7  Other  scalability  analysis  metrics 

We  have  alluded  to  constraints  on  runtime  and  memory  as  being  important 
determinants  of  scalability.  These  constraints  can  be  incorporated  directly  into 
scalability  metrics.  One  such  metric,  called  Scaled  Speedup,  increases  the  prob¬ 
lem  size  linearly  with  the  number  of  processing  cores  [41,  42,  43,  74,  80].  If  the 
scaled-speedup  curve  is  close  to  linear  with  respect  to  the  number  of  processing 
cores,  then  the  parallel  system  is  considered  scalable.  This  metric  is  related 
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to  isoefficiency  if  the  parallel  algorithm  under  consideration  has  linear  or  near- 
linear  isoefficiency  function.  In  this  case  the  scaled-speedup  metric  provides 
results  very  close  to  those  of  isoefficiency  analysis,  and  the  scaled-speedup  is 
linear  or  near-linear  with  respect  to  the  number  of  processing  cores.  For  par¬ 
allel  systems  with  much  worse  isoefficiencies,  the  results  provided  by  the  two 
metrics  may  be  quite  different.  In  this  case,  the  scaled-speedup  versus  number 
of  processing  cores  curve  is  sublinear. 

Two  generalized  notions  of  scaled  speedup  have  been  investigated.  They 
differ  in  the  methods  by  which  the  problem  size  is  scaled  up  with  the  number 
of  processing  elements.  In  one  method,  the  size  of  the  problem  is  increased  to 
fill  the  available  memory  on  the  parallel  computer.  The  assumption  here  is  that 
aggregate  memory  of  the  system  increases  with  the  number  of  processing  cores. 
In  the  other  method,  the  size  of  the  problem  grows  with  p  subject  to  a  bound 
on  parallel  execution  time.  We  illustrate  these  using  an  example: 

Memory  and  time-constrained  scaled  speedup  for  matrix- vector  prod¬ 
ucts.  The  serial  runtime  of  multiplying  a  matrix  of  dimension  n  x  n  with  a 
vector  is  n2  (with  normalized  unit  execution  time).  The  corresponding  parallel 
runtime  using  a  simple  parallel  algorithm  is  given  by: 

9 

n 

Tp  = - 1-  logp  +  n 

V 

and  the  speedup  S  is  given  by: 

n2 

S=  — -  (5.11) 

f+log  p  +  n 

The  total  memory  requirement  of  the  algorithm  is  @(n2).  Let  us  consider  the 
two  cases  of  problem  scaling.  In  the  case  of  memory  constrained  scaling,  we 
assume  that  the  memory  of  the  parallel  system  grows  linearly  with  the  number 
of  processing  cores,  i.e. ,  m  =  @(p).  This  is  a  reasonable  assumption  for  most 
current  parallel  platforms.  Since  m  =  Q(n2),  we  have  n2  =  c  x  p,  for  some 
constant  c.  Therefore,  the  scaled  speedup  S'  is  given  by: 

s,  =  _ exp _ 

^  +log  p+  ^/c  x  p 

or 

s,  = _ £i P _ 

C2  +  C3l0gp  +  C4y/p' 

In  the  limiting  case,  S'  =  0{^/p). 

In  the  case  of  time  constrained  scaling,  we  have  Tp  =  0(n2  /p),  and  since  this 
is  constrained  to  be  constant,  we  have  n2  =  0(p).  We  notice  that  this  case  is 
identical  to  the  memory  constrained  case.  This  happened  because  the  memory 
and  runtime  of  the  algorithm  are  asymptotically  identical. 
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Memory  and  time-constrained  scaled  speedup  for  matrix-matrix 
products.  The  serial  runtime  of  multiplying  two  matrices  of  dimension  n  x  n 
is  n3.  The  corresponding  parallel  runtime  using  a  simple  parallel  algorithm  is 
given  by: 

T  n3  ,  i  ,  n2 

TP  = - 1-  log  p  +  — 

P  Vp 

and  the  speedup  S  is  given  by: 


S  = 


f  +1°gP+7P 


(5.12) 


The  total  memory  requirement  of  the  algorithm  is  @(n2).  Let  us  consider  the 
two  cases  of  problem  scaling.  In  the  case  of  memory  constrained  scaling,  as 
before,  we  assume  that  the  memory  of  the  parallel  system  grows  linearly  with 
the  number  of  processing  elements,  i.e.,  m  =  O(p).  Since  m  =  @(n2),  we  have 
n2  =  c  x  p,  for  some  constant  c.  Therefore,  the  scaled  speedup  S'  is  given  by: 


S' = 


(c  x  p)1-5 

^^+logp  +  ^ 


0(p) 


In  the  case  of  time  constrained  scaling,  we  have  Tp  =  0(n3 /p).  Since  this 
is  constrained  to  be  constant,  n3  =  O(p),  or  n3  =  c  x  p  (for  some  constant  c). 
Therefore,  the  time-constrained  speedup  S"  is  given  by: 


S" 


exp 


c-f  +logp- 


(exp)2/3 

Vp 


o(p5/6) 


This  example  illustrates  that  memory-constrained  scaling  yields  linear  speedup, 
whereas  time-constrained  speedup  yields  sublinear  speedup  in  the  case  of  matrix 
multiplication. 


Serial  Fraction  /.  The  experimentally  determined  serial  fraction  /  can  be 
used  to  quantify  the  performance  of  a  parallel  system  on  a  fixed-size  problem. 
Consider  a  case  when  the  serial  runtime  of  a  computation  can  be  divided  into  a 
totally  parallel  and  a  totally  serial  component,  i.e., 

W  =  Tser  +  Tpar. 

Here,  Tser  and  Tpar  correspond  to  totally  serial  and  totally  parallel  components. 
From  this,  we  can  write: 

T 

rp  _  rr  ,  ±par 
-Ip  —  -L  ser  ~i 

p 

Here,  we  have  assumed  that  all  of  the  other  parallel  overheads  such  as  excess 
computation  and  communication  are  captured  in  the  serial  component  Tser. 
From  these  equations,  it  follows  that: 

Tp  =  Tser  +  W  ~  Tser  (5.13) 

P 
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The  serial  fraction  /  of  a  parallel  program  is  defined  as: 

_  Tser 

1  ~  W  ' 


Therefore,  from  Equation  5.13,  we  have: 

Tp  =  fxW  + 


W- f  x  IT 


£  =  f+^ 
W  J  p 


Since  S  =  W/Tp,  we  have 


Solving  for  /,  we  get: 


S  p 


l/s-l/p 


J  1-1  /p  ■  v  ' 

It  is  easy  to  see  that  smaller  values  of  /  are  better  since  they  result  in  higher 
efficiencies.  If  /  increases  with  the  number  of  processing  elements,  then  it  is 
an  indicator  of  rising  communication  overhead,  and  thus  an  indicator  of  poor 
scalability. 

Serial  component  of  the  matrix-vector  product.  From  Equations  5.14 
and  5.11,  we  have 

_2  , 

— +log  p+n 

f=^?vr  (5J5) 

Simplifying  the  above  expression,  we  get 

f  _  plogp  +  np  ^  1 

•1  2  ^  T 

nz  p  —  1 

^  log  p  +  n 

J  ~  O 

nz 

It  is  useful  to  note  that  the  denominator  of  this  equation  is  the  serial  runtime 
of  the  algorithm  and  the  numerator  corresponds  to  the  overhead  in  parallel 
execution. 

A  comprehensive  discussion  of  various  scalability  and  performance  measures 
can  be  found  in  the  survey  by  Kumar  and  Gupta  [50] .  While  over  ten  years  old, 
the  results  cited  in  this  survey  and  their  applications  are  particularly  relevant 
now,  as  scalable  parallel  platforms  are  finally  being  realized. 


5.4  Heterogeneous  Composition  of  Applications 

Many  applications  are  composed  of  multiple  steps,  with  differing  scaling  char¬ 
acteristics.  For  example,  in  commonly  used  molecular  dynamics  techniques,  the 
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potentials  on  particles  are  evaluated  as  sums  of  near  and  far-field  potentials. 
Near  field  potentials  consider  particles  in  a  localized  neighborhood,  typically 
resulting  in  a  linear  algorithmic  complexity.  Far-held  potentials,  along  with 
periodic  boundary  conditions  are  evaluated  using  Ewald  summations,  which 
involve  computation  of  an  FFT.  Short-range  potentials  can  be  evaluated  in  a 
scalable  manner  on  most  platforms,  since  they  have  favorable  communication  to 
computation  characteristics  (communication  is  required  only  for  particles  close 
to  the  periphery  of  a  processing  core’s  sub-domain).  FFTs,  on  the  other  hand 
have  more  exacting  requirement  on  network  bandwidth,  as  demonstrated  earlier 
in  this  chapter. 

Traditional  parallel  paradigms  attempt  to  distribute  computation  associated 
with  both  phases  across  all  processing  cores.  However,  since  FFTs  generally 
scale  worse  than  the  short-range  potentials,  it  is  possible  to  use  a  heterogeneous 
partition  of  the  platform  -  with  one  partition  working  on  short-range  potentials, 
and  the  other  on  computing  the  FFT.  The  results  from  the  two  are  combined 
to  determine  total  potential  (and  force)  at  each  particle. 

Scalability  analysis  is  a  critical  tool  in  understanding  and  developing  het¬ 
erogeneous  parallel  formulations.  The  tradeoffs  of  sizes  of  heterogeneous  par¬ 
titions,  algorithm  scaling,  and  serial  complexities  are  critical  determinants  of 
overall  performance. 
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Chapter  6 


Summary  and  Conclusion 


We  presented  an  effective  parallel  algorithm,  SPIKE,  for  solving  banded  linear 
systems.  Comparisons  against  the  corresponding  solvers  in  ScaLapack  prove 
that  our  algorithm  used  as  a  direct  solver  is  faster  than  ScaLapack,  and  if  one  is 
satisfied  with  reasonable  relative  residuals,  SPIKE  can  be  used  as  a  very  effec¬ 
tive  preconditioner  of  an  iterative  scheme  such  as  BiCGstab.  We  demonstrated 
the  effectiveness  of  SPIKE  on  varying  numbers  of  processors  with  increasing 
bandwidth,  as  well  as  its  favorable  scalability  on  a  large  number  of  processors 
on  the  IBM-SP.  Clearly,  in  deciding  on  the  original  partitioning  of  the  system, 
a  balance  must  be  achieved  between  the  computational  cost  of  solving  the  sub¬ 
problems  and  the  communication  overhead.  For  parallel  architectures  in  which 
the  communication  costs  are  relatively  high,  the  optimal  implementation  of 
the  SPIKE  algorithm  will  favor  fewer  partitions  to  reduce  the  communication 
overhead.  On  the  other  hand,  on  parallel  architectures  with  relatively  fast  inter¬ 
connects  and  in  which  each  CPU  has  vector  processing  capabilities,  the  SPIKE 
algorithm  can  be  further  modified  to  capitalize  on  trade-offs  between  parallel 
and  vector  processing. 

Three  members  of  the  SPIKE  family  of  schemes  have  been  described:  (i)  the 
Recursive  SPIKE,  (ii)  the  Truncated  SPIKE  and  (iii)  the  SPIKE  “on-the-fly” . 
Each  of  these  schemes  uses  a  different  solution  strategy  of  the  reduced  system. 
The  resulting  reduced  system  can  be  solved  either  explicitly  using  a  parallel 
recursive  method,  approximately,  or  implicitly  using  iterative  methods.  The 
Recursive  and  Truncated  versions  of  the  algorithm  for  handling  both  diagonally 
and  non-diagonally  dominant  systems,  have  been  implemented  and  compared 
with  the  corresponding  solvers  in  ScaLapack  for  systems  that  are  dense  within 
the  band.  The  speed  improvements  realized  by  SPIKE  over  ScaLapack  are 
significant.  The  favorable  scalability  of  SPIKE  has  also  been  demonstrated  on 
the  Intel-Xeon  cluster.  For  banded  systems  that  are  sparse  within  the  band, 
SPIKE  “on-the-fly”  with  two-levels  of  parallelism  exhibits  good  scalability. 

We  should  note  that  the  SPIKE  environment  is  specifically  well  suited  for 
recently  introduced  computational  fluid  dynamics  techniques  such  as  those  in 
[75],  [76],  or  [73].  In  [75]  the  finite  element  grid  is  partitioned  (statically  or 
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dynamically)  into  “iterative”  or  “direct”  groups,  depending  on  which  part  of  the 
grid  poses  a  challenge  to  iterative  schemes.  Also,  the  two-level  SPIKE  scheme 
is  ideally  suited  for  the  multi-scale  solution  technique  presented  in  [76].  Finally, 
any  member  of  the  SPIKE  family  can  be  used  for  handling  those  linear  systems 
that  arise  in  the  “solid-extension  mesh  moving  technique”,  in  [73],  in  regions 
of  the  grid  with  very  small  elements  which  are  at  risk  of  being  “entangled”  if 
classical  iterative  schemes  (with  “black-box”  preconditioners)  converge  slowly. 

We  have  also  demonstrated  that  banded  preconditioners  can  be  attractive 
alternatives  to  ILU-type  preconditioners,  which  might  be  desirable  particularly 
for  parallel  solvers.  Banded  preconditioners  offer  high  FLOP  counts  and  con¬ 
currency  in  addition  to  desirable  convergence  properties  for  a  range  of  problems. 
We  identified  the  central  role  of  matrix  reordering  in  the  performance  of  banded 
preconditioners.  We  proposed  the  use  of  weighted  spectral  ordering  for  this  pur¬ 
pose.  We  demonstrated  that,  weighted  bandwidth  reduction  based  on  spectral 
reordering,  used  in  conjunction  with  efficient  banded  solvers  is  capable  of  sig¬ 
nificantly  better  performance  than  even  optimized  versions  of  ILU  for  a  variety 
of  problems,  both  in  terms  of  convergence  and  time-to-solution. 

We  have  also  developed  a  parallel  hybrid  algorithm  for  the  solution  of  banded 
linear  systems  that  result  from  domain  decomposition.  The  conditions  that 
must  be  satisfied  to  guarantee  that  the  balance  system  is  symmetric  positive 
definite  or  nonsingular  are  also  stated.  Our  hybrid  algorithm  can  be  viewed  as 
solving  a  large  system  of  linear  equations  on  one  hand,  and  performing  the  LU- 
factorization  of  several  smaller  independent  linear  systems  once,  but  applying 
the  forward  and  backward  sweeps  several  times,  on  the  other.  We  describe  how 
the  banded  system  can  be  “torn”  into  overlapped  smaller  systems  that  can  be 
solved  independently  under  certain  constraints,  giving  rise  to  a  much  smaller 
balance  system  that  is  not  formed  explicitly.  Further,  we  showed  how  this  sys¬ 
tem  can  be  solved  via  a  modified  preconditioned  Krylov  subspace  method  with 
a  preconditioner  that  takes  advantage  of  the  decay  property  of  the  inverse  of 
banded  systems.  Finally,  numerical  experiments  show  that  our  proposed  algo¬ 
rithm  outperforms  sequential  LAPACK,  parallel  ScaLapack,  preconditioned  CG 
and  BiCGStab  for  symmetric  and  non-symmetric  banded  systems,  respectively. 

Finally,  we  have  demonstrated  the  power  of  scalability  analysis,  and  more 
generally,  analytical  modeling.  It  is  not  always  possible  to  do  such  modeling 
under  tight  bounds.  Analytical  modeling  must  account  for  finite  resources  and 
parameters  such  as  line  sizes,  replacement  policies,  and  other  related  intricacies. 
Clearly  such  analysis,  if  performed  precisely,  becomes  specific  to  problem  in¬ 
stance  and  platform,  and  does  not  generalize.  While  aforementioned  limitations 
apply  specifically  to  small-to-moderate  sized  systems,  larger  aggregates  (teras- 
cale  and  beyond)  are  generally  programmed  through  explicit  message  transfers. 
This  is  accomplished  either  through  MPI-style  messaging,  or  explicit  (put-get) 
or  implicit  (partitioned  global  arrays)  one-way  primitives.  In  such  cases,  even 
with  loose  bounds  on  performance  of  individual  multicore  processors,  we  can 
establish  tight  bounds  on  scalability  of  the  complete  parallel  system.  Analytical 
modeling  and  scalability  analysis  can  be  reliably  applied  to  such  systems  as  well. 
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